Final Group Project: AirBnB Listings Cape_Town
Executive Summary
This report explores a dataset coming from AirBnB containing information on property listings in Cape Town, South Africa. The aim of this reports is to establish a multiple regression model capable of predicting the price for 2 people staying for 4 nights based on a series of factors.
We will begin by exploring the data itself to understand its composition and underlying factors. Next, we cleaned and improved the data making it more manageable by simplifying certain categorical variables and dealing with missing values in the variables we deemed useful. Similarly, we removed variables we considered as irrelevant for the purposes of the model.
We also created a map with geo taggings of price, property type and attached url and ran a series of visualizations to gain an understanding of the variables included in the dataset before testing and running regressions.
Finally we conducted a series of regression tests on varying factors as well as applying a series of transofrmations in order to find the best model possible, before finally predicting the price for 4 nights for 2 people based on a series of predetermined inputs. Diagnostics have been conducted throughout the regression section as we laid out mobel by model. After coming up with a final model we left some suggestions for improvement and further analysis.
Data Preparation
We start by loading packages.
Exploratory Data Analysis
Load the dataset
First, we will load the data set from an URL provided by our professor.
listings <- vroom("http://data.insideairbnb.com/south-africa/wc/cape-town/2020-06-21/data/listings.csv.gz") %>%
clean_names()
Raw values
To get an understanding of the data, we will look at the raw values with the glimpse function.
dplyr::glimpse(listings)
## Rows: 24,062
## Columns: 106
## $ id <dbl> 3191, 15007, 15068, 1507…
## $ listing_url <chr> "https://www.airbnb.com/…
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped <date> 2020-06-23, 2020-06-24,…
## $ name <chr> "Malleson Garden Cottage…
## $ summary <chr> "This is a lovely, separ…
## $ space <chr> "Beautiful self catering…
## $ description <chr> "This is a lovely, separ…
## $ experiences_offered <chr> "none", "none", "none", …
## $ neighborhood_overview <chr> "Mowbray is on the South…
## $ notes <chr> "There are pets on the p…
## $ transit <chr> "The train station is ab…
## $ access <chr> "Fully self-contained co…
## $ interaction <chr> "I live in the main hous…
## $ house_rules <chr> "There is no smoking in …
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url <chr> "https://a0.muscache.com…
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id <dbl> 3754, 59072, 59318, 5934…
## $ host_url <chr> "https://www.airbnb.com/…
## $ host_name <chr> "Brigitte", "Dirk", "Lin…
## $ host_since <date> 2008-10-21, 2009-12-01,…
## $ host_location <chr> "Cape Town, Western Cape…
## $ host_about <chr> "I'm single and love to …
## $ host_response_time <chr> "N/A", "within a day", "…
## $ host_response_rate <chr> "N/A", "100%", "N/A", "1…
## $ host_acceptance_rate <chr> "100%", "100%", "33%", "…
## $ host_is_superhost <lgl> TRUE, TRUE, FALSE, FALSE…
## $ host_thumbnail_url <chr> "https://a0.muscache.com…
## $ host_picture_url <chr> "https://a0.muscache.com…
## $ host_neighbourhood <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_listings_count <dbl> 1, 11, 13, 5, 1, 1, 3, 1…
## $ host_total_listings_count <dbl> 1, 11, 13, 5, 1, 1, 3, 1…
## $ host_verifications <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified <lgl> TRUE, TRUE, FALSE, TRUE,…
## $ street <chr> "Southern Suburbs, Weste…
## $ neighbourhood <lgl> NA, NA, NA, NA, NA, NA, …
## $ neighbourhood_cleansed <chr> "Ward 57", "Ward 23", "W…
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, …
## $ city <chr> "Southern Suburbs", "Cap…
## $ state <chr> "Western Cape", "Western…
## $ zipcode <dbl> 7700, 7441, 7441, 7441, …
## $ market <chr> "Cape Town", "Cape Town"…
## $ smart_location <chr> "Southern Suburbs, South…
## $ country_code <chr> "ZA", "ZA", "ZA", "ZA", …
## $ country <chr> "South Africa", "South A…
## $ latitude <dbl> -33.9, -33.8, -33.8, -33…
## $ longitude <dbl> 18.5, 18.5, 18.5, 18.5, …
## $ is_location_exact <lgl> TRUE, TRUE, FALSE, TRUE,…
## $ property_type <chr> "Guesthouse", "House", "…
## $ room_type <chr> "Entire home/apt", "Enti…
## $ accommodates <dbl> 2, 6, 6, 2, 2, 2, 4, 3, …
## $ bathrooms <dbl> 1.0, 3.0, 2.0, 1.5, 1.0,…
## $ bedrooms <dbl> 1, 3, 3, 1, 1, 1, 2, 1, …
## $ beds <dbl> 1, 4, 5, 2, 1, 1, 4, 1, …
## $ bed_type <chr> "Real Bed", "Real Bed", …
## $ amenities <chr> "{TV,\"Cable TV\",Wifi,\…
## $ square_feet <dbl> NA, NA, NA, NA, NA, NA, …
## $ price <chr> "$815.00", "$2,038.00", …
## $ weekly_price <chr> "$5,204.00", "$9,000.00"…
## $ monthly_price <chr> "$20,816.00", "$40,000.0…
## $ security_deposit <chr> NA, "$2,500.00", "$3,000…
## $ cleaning_fee <chr> "$260.00", "$850.00", "$…
## $ guests_included <dbl> 2, 6, 1, 2, 2, 3, 2, 1, …
## $ extra_people <chr> "$0.00", "$250.00", "$0.…
## $ minimum_nights <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ maximum_nights <dbl> 730, 120, 730, 1125, 730…
## $ minimum_minimum_nights <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ maximum_minimum_nights <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ minimum_maximum_nights <dbl> 730, 120, 730, 1125, 730…
## $ maximum_maximum_nights <dbl> 730, 120, 730, 1125, 730…
## $ minimum_nights_avg_ntm <dbl> 3.0, 2.0, 4.0, 5.0, 2.0,…
## $ maximum_nights_avg_ntm <dbl> 730, 120, 730, 1125, 730…
## $ calendar_updated <chr> "7 months ago", "4 month…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30 <dbl> 30, 0, 30, 0, 30, 30, 0,…
## $ availability_60 <dbl> 60, 0, 60, 0, 60, 60, 0,…
## $ availability_90 <dbl> 90, 0, 90, 0, 90, 90, 0,…
## $ availability_365 <dbl> 365, 0, 365, 234, 365, 3…
## $ calendar_last_scraped <date> 2020-06-23, 2020-06-24,…
## $ number_of_reviews <dbl> 54, 27, 0, 5, 0, 2, 8, 3…
## $ number_of_reviews_ltm <dbl> 6, 1, 0, 0, 0, 0, 1, 0, …
## $ first_review <date> 2013-05-31, 2013-12-15,…
## $ last_review <date> 2020-03-07, 2019-08-18,…
## $ review_scores_rating <dbl> 97, 97, NA, 100, NA, 90,…
## $ review_scores_accuracy <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ review_scores_cleanliness <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ review_scores_checkin <dbl> 10, 10, NA, 10, NA, 9, 1…
## $ review_scores_communication <dbl> 10, 10, NA, 10, NA, 10, …
## $ review_scores_location <dbl> 10, 10, NA, 10, NA, 10, …
## $ review_scores_value <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ requires_license <lgl> FALSE, FALSE, FALSE, FAL…
## $ license <dbl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable <lgl> TRUE, FALSE, FALSE, FALS…
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy <chr> "moderate", "strict_14_w…
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count <dbl> 1, 4, 10, 5, 1, 1, 3, 1,…
## $ calculated_host_listings_count_entire_homes <dbl> 1, 4, 10, 1, 1, 1, 2, 1,…
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 0, 4, 0, 0, 0, 0, …
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month <dbl> 0.63, 0.34, NA, 0.06, NA…
We find that the dataset has 24,602 observations (rows) with 106 variables (columns).
When looking at the data set as a whole, we can see that there are many different data types, such as characters chr, numeric doubles dbl, logicals lgl and dates date. Nevertheless, not all variables seem to be stored in the correct variable type. Let’s have a closer look.
We find the following 37 variables correctly stored as doubles:
host_listings_counthost_total_listings_countzipcodelatitudelongitudeaccommodatesbathroomsbedroomsbedssquare_feetguests_includedminimum_nightsmaximum_nightsminimum_minimum_nightsminimum_maximum_nightsmaximum_maximum_nightsmaximum_minimum_nightsminimum_nights_avg_ntmmaximum_nights_avg_ntmavailability_30availability_60availability_90availability_365number_of_reviewsnumber_of reviews_ltmreview_scores_ratingreview_scores_accuracyreview_scores_cleanlinessreview_scores_checkinreview_scores_communicationreview_scores_locationreview_scores_valuecalculated_host_listings_countcalculated_host_listings_count_entire_homescalculated_host_listings_count_private_roomscalculated_host_listings_count_shared_roomsreviews_per_month
We found the following 3 variables stored as doubles but believe they would be better stored as characters as they function more like names rather than numbers.
idscrape_idhost_id
As these variables will however not be relevant for our further anaylsis, we will leave them untouched when cleaning the data set.
We did, however, find the following 8 variables that are wrongly stored and must be stored as numeric variables:
pricehost_response_ratehost_acceptance_rateweekly_pricemonthly_pricesecurity_depositcleaning_feeextra_people
We will change these data types by cleaning the data.
As price is currently a character instead of a numeric variable, we want to change its variable type in the following. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number.
While we are unsure about the currency, we will assume it’s in USD and not the local currency Rand.
listings <- listings %>%
mutate(price = parse_number(price))
#confirm that price is now stored as a number
typeof(listings$price)
## [1] "double"
Next, we change the rest of the variables listed above to numeric. We will keep in mind that the currency of all those is in dollars as we will eliminate the character “$” from the value.
#change all to numerics
listings <- listings %>%
mutate(weekly_price = parse_number(weekly_price))
listings <- listings %>%
mutate(monthly_price = parse_number(monthly_price))
listings <- listings %>%
mutate(security_deposit = parse_number(security_deposit))
listings <- listings %>%
mutate(cleaning_fee = parse_number(cleaning_fee))
listings <- listings %>%
mutate(extra_people = parse_number(extra_people))
#check all
typeof(listings$weekly_price)
## [1] "double"
typeof(listings$monthly_price)
## [1] "double"
typeof(listings$security_deposit)
## [1] "double"
typeof(listings$cleaning_fee)
## [1] "double"
typeof(listings$extra_people)
## [1] "double"
Lastly, we change the rates into numerics, keeping in mind that they are in percentages as we will eliminate the “%” sign from the value.
#change all to numerics
listings <- listings %>%
mutate(host_response_rate = parse_number(host_response_rate))
listings <- listings %>%
mutate(host_acceptance_rate = parse_number(host_acceptance_rate))
#check all
typeof(listings$host_response_rate)
## [1] "double"
typeof(listings$host_acceptance_rate)
## [1] "double"
While there are multiple variables in the dataset, here is a quick description of some of the relevant variables collected, with cost data typically expressed in US$. We will focus our analysis on these.
price: cost per night
cleaning_fee: cleaning fee
extra_people: charge for having more than 1 personproperty_type: type of accommodation (House, Apartment, etc.)room_type:- Entire home/apt (guests have entire place to themselves)
- Private room (Guests have private room to sleep, all other rooms shared)
- Shared room (Guests sleep in room shared with others)
number_of_reviews: Total number of reviews for the listingreview_scores_rating: Average review score (0 - 100)longitude , latitude: geographical coordinates to help us locate the listingneighbourhood: three variables on a few major neighbourhoods in each city
Summary Statistics
Skim
In order to compute summary statistics of our variables and to find NAs, we will now run both the favstats() and the skim() function.
skimr::skim(listings)
| Name | listings |
| Number of rows | 24062 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 35 |
| Date | 5 |
| logical | 17 |
| numeric | 49 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 24062 | 0 |
| name | 0 | 1.00 | 1 | 255 | 0 | 23472 | 0 |
| summary | 788 | 0.97 | 1 | 1000 | 0 | 21864 | 0 |
| space | 6737 | 0.72 | 1 | 1000 | 0 | 16201 | 0 |
| description | 537 | 0.98 | 1 | 1000 | 0 | 22699 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 7661 | 0.68 | 1 | 1000 | 0 | 14099 | 0 |
| notes | 13302 | 0.45 | 1 | 1000 | 0 | 9081 | 0 |
| transit | 8467 | 0.65 | 1 | 1000 | 0 | 13469 | 0 |
| access | 10271 | 0.57 | 1 | 1000 | 0 | 11566 | 0 |
| interaction | 9120 | 0.62 | 1 | 1000 | 0 | 12194 | 0 |
| house_rules | 12005 | 0.50 | 1 | 1000 | 0 | 10309 | 0 |
| picture_url | 0 | 1.00 | 80 | 146 | 0 | 23716 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 14826 | 0 |
| host_name | 2 | 1.00 | 1 | 35 | 0 | 5400 | 0 |
| host_location | 125 | 0.99 | 2 | 254 | 0 | 610 | 0 |
| host_about | 11128 | 0.54 | 1 | 3208 | 0 | 7090 | 18 |
| host_response_time | 2 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_thumbnail_url | 2 | 1.00 | 55 | 106 | 0 | 14761 | 0 |
| host_picture_url | 2 | 1.00 | 57 | 109 | 0 | 14761 | 0 |
| host_verifications | 0 | 1.00 | 2 | 161 | 0 | 315 | 0 |
| street | 0 | 1.00 | 16 | 67 | 0 | 441 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 6 | 8 | 0 | 99 | 0 |
| city | 54 | 1.00 | 1 | 39 | 0 | 259 | 0 |
| state | 401 | 0.98 | 2 | 26 | 0 | 56 | 0 |
| market | 30 | 1.00 | 4 | 21 | 0 | 4 | 0 |
| smart_location | 0 | 1.00 | 12 | 53 | 0 | 293 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 2 | 0 |
| country | 0 | 1.00 | 5 | 12 | 0 | 2 | 0 |
| property_type | 0 | 1.00 | 3 | 22 | 0 | 42 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1672 | 0 | 22932 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 73 | 0 |
| cancellation_policy | 0 | 1.00 | 6 | 27 | 0 | 9 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-21 | 2020-06-25 | 2020-06-23 | 5 |
| host_since | 2 | 1.00 | 2008-08-11 | 2020-06-17 | 2016-03-12 | 2999 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-21 | 2020-06-25 | 2020-06-23 | 5 |
| first_review | 6883 | 0.71 | 2010-06-15 | 2020-06-22 | 2018-03-31 | 2226 |
| last_review | 6883 | 0.71 | 2013-04-19 | 2020-06-22 | 2020-02-04 | 1407 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 24062 | 0 | NaN | : |
| medium_url | 24062 | 0 | NaN | : |
| xl_picture_url | 24062 | 0 | NaN | : |
| host_is_superhost | 2 | 1 | 0.23 | FAL: 18493, TRU: 5567 |
| host_neighbourhood | 24062 | 0 | NaN | : |
| host_has_profile_pic | 2 | 1 | 1.00 | TRU: 23984, FAL: 76 |
| host_identity_verified | 2 | 1 | 0.26 | FAL: 17792, TRU: 6268 |
| neighbourhood | 24062 | 0 | NaN | : |
| neighbourhood_group_cleansed | 24062 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.79 | TRU: 19060, FAL: 5002 |
| has_availability | 0 | 1 | 1.00 | TRU: 24062 |
| requires_license | 0 | 1 | 0.00 | FAL: 24062 |
| jurisdiction_names | 24062 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.52 | TRU: 12426, FAL: 11636 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 24062 |
| require_guest_profile_picture | 0 | 1 | 0.00 | FAL: 24009, TRU: 53 |
| require_guest_phone_verification | 0 | 1 | 0.00 | FAL: 23977, TRU: 85 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.42e+07 | 1.22e+07 | 3.19e+03 | 1.46e+07 | 2.30e+07 | 3.53e+07 | 4.39e+07 | ▃▇▆▆▇ |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 9.93e+07 | 9.15e+07 | 1.69e+03 | 2.63e+07 | 6.24e+07 | 1.55e+08 | 3.50e+08 | ▇▃▂▂▁ |
| host_response_rate | 13639 | 0.43 | 9.25e+01 | 2.13e+01 | 0.00e+00 | 1.00e+02 | 1.00e+02 | 1.00e+02 | 1.00e+02 | ▁▁▁▁▇ |
| host_acceptance_rate | 5234 | 0.78 | 8.22e+01 | 2.81e+01 | 0.00e+00 | 7.50e+01 | 9.70e+01 | 1.00e+02 | 1.00e+02 | ▁▁▁▁▇ |
| host_listings_count | 2 | 1.00 | 1.86e+01 | 1.36e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 2 | 1.00 | 1.86e+01 | 1.36e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| zipcode | 661 | 0.97 | 7.79e+03 | 9.96e+02 | 0.00e+00 | 7.53e+03 | 7.92e+03 | 8.00e+03 | 8.01e+04 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | -3.40e+01 | 1.00e-01 | -3.43e+01 | -3.40e+01 | -3.39e+01 | -3.39e+01 | -3.35e+01 | ▂▃▇▁▁ |
| longitude | 0 | 1.00 | 1.85e+01 | 1.30e-01 | 1.83e+01 | 1.84e+01 | 1.84e+01 | 1.85e+01 | 1.89e+01 | ▇▅▁▁▁ |
| accommodates | 0 | 1.00 | 3.89e+00 | 2.46e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 5.00e+00 | 4.00e+01 | ▇▁▁▁▁ |
| bathrooms | 15 | 1.00 | 1.74e+00 | 1.23e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bedrooms | 27 | 1.00 | 1.88e+00 | 1.43e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| beds | 107 | 1.00 | 2.40e+00 | 2.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| square_feet | 23995 | 0.00 | 9.27e+02 | 1.32e+03 | 0.00e+00 | 2.60e+02 | 5.92e+02 | 1.04e+03 | 6.46e+03 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 2.47e+03 | 7.53e+03 | 0.00e+00 | 6.59e+02 | 1.06e+03 | 2.00e+03 | 3.00e+05 | ▇▁▁▁▁ |
| weekly_price | 22826 | 0.05 | 1.98e+04 | 7.25e+04 | 6.20e+01 | 3.57e+03 | 6.26e+03 | 1.20e+04 | 1.30e+06 | ▇▁▁▁▁ |
| monthly_price | 22987 | 0.04 | 5.98e+04 | 2.15e+05 | 2.52e+02 | 1.14e+04 | 1.92e+04 | 3.48e+04 | 2.77e+06 | ▇▁▁▁▁ |
| security_deposit | 8559 | 0.64 | 2.77e+03 | 6.32e+03 | 0.00e+00 | 0.00e+00 | 1.48e+03 | 3.00e+03 | 1.50e+05 | ▇▁▁▁▁ |
| cleaning_fee | 7049 | 0.71 | 3.38e+02 | 5.12e+02 | 0.00e+00 | 1.00e+02 | 3.00e+02 | 4.50e+02 | 4.00e+04 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.62e+00 | 1.44e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.40e+01 | ▇▁▁▁▁ |
| extra_people | 0 | 1.00 | 8.38e+01 | 2.20e+02 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 9.65e+01 | 4.68e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 4.35e+00 | 1.44e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 1.13e+03 | 6.45e+04 | 1.00e+00 | 6.00e+01 | 1.12e+03 | 1.12e+03 | 1.00e+07 | ▇▁▁▁▁ |
| minimum_minimum_nights | 0 | 1.00 | 4.06e+00 | 1.39e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 5.03e+00 | 1.51e+01 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 5.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 1.24e+03 | 6.45e+04 | 1.00e+00 | 1.60e+02 | 1.12e+03 | 1.12e+03 | 1.00e+07 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 1.25e+03 | 6.45e+04 | 1.00e+00 | 1.80e+02 | 1.12e+03 | 1.12e+03 | 1.00e+07 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 4.36e+00 | 1.44e+01 | 1.00e+00 | 1.30e+00 | 2.00e+00 | 4.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 1.25e+03 | 6.45e+04 | 1.00e+00 | 1.80e+02 | 1.12e+03 | 1.12e+03 | 1.00e+07 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 1.78e+01 | 1.35e+01 | 0.00e+00 | 0.00e+00 | 2.70e+01 | 3.00e+01 | 3.00e+01 | ▆▁▁▁▇ |
| availability_60 | 0 | 1.00 | 3.71e+01 | 2.67e+01 | 0.00e+00 | 0.00e+00 | 5.50e+01 | 5.90e+01 | 6.00e+01 | ▅▁▁▁▇ |
| availability_90 | 0 | 1.00 | 5.71e+01 | 3.96e+01 | 0.00e+00 | 0.00e+00 | 8.30e+01 | 8.90e+01 | 9.00e+01 | ▅▁▁▁▇ |
| availability_365 | 0 | 1.00 | 2.03e+02 | 1.42e+02 | 0.00e+00 | 6.10e+01 | 2.06e+02 | 3.57e+02 | 3.65e+02 | ▅▂▃▂▇ |
| number_of_reviews | 0 | 1.00 | 1.44e+01 | 2.94e+01 | 0.00e+00 | 0.00e+00 | 3.00e+00 | 1.40e+01 | 4.25e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 4.51e+00 | 8.67e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 2.27e+02 | ▇▁▁▁▁ |
| review_scores_rating | 7314 | 0.70 | 9.46e+01 | 8.87e+00 | 2.00e+01 | 9.30e+01 | 9.70e+01 | 1.00e+02 | 1.00e+02 | ▁▁▁▁▇ |
| review_scores_accuracy | 7334 | 0.70 | 9.63e+00 | 9.00e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 7329 | 0.70 | 9.54e+00 | 9.40e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_checkin | 7336 | 0.70 | 9.75e+00 | 7.80e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 7332 | 0.70 | 9.75e+00 | 7.90e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 7336 | 0.70 | 9.72e+00 | 7.30e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 7341 | 0.69 | 9.50e+00 | 9.30e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| license | 24037 | 0.00 | 4.36e+09 | 3.10e+08 | 4.01e+09 | 4.03e+09 | 4.32e+09 | 4.66e+09 | 4.74e+09 | ▇▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 1.29e+01 | 5.04e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 3.86e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.16e+01 | 4.86e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.72e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 8.40e-01 | 2.06e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.40e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 3.00e-02 | 4.40e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.20e+01 | ▇▁▁▁▁ |
| reviews_per_month | 6883 | 0.71 | 7.20e-01 | 8.90e-01 | 1.00e-02 | 1.40e-01 | 3.60e-01 | 9.80e-01 | 8.64e+00 | ▇▁▁▁▁ |
We find that the original data set includes: - 43 character variables - 41 numeric variables - 17 logical variables - 5 date variables
(before data cleaning done in previous section)
After the first data cleaning in the section above, we find:
- 35 character variables
- 49 numeric variables
- 17 logical variables
- 5 date variables
MISSING VALUES
This is how many missing values we find within our relevant variables:
price: 0 missingextra_people: 0 missingproperty_type: 0 missingroom_type: 0 missingnumber_of_reviews: 0 missinglongitude , latitude: both 0 missingneighbourhood_cleansed: 0 missingcleaning_fee: 7,049 missingreview_scores_rating: 7,314 missing
Note: We will go with price as relevant price variable as it indicates the price per night and the other price variables (weekly_price and monthly_price) have several missing values.
As a missing value for cleaning_fee highly likely indicates, that there is no cleaning fee, we will convert all NAs to 0.
We already changed the cleaning_fee data type to numeric in the previous section and will now substitute all NAs with 0s as described.
#replace all NAs with 0
listings <- listings %>%
mutate(cleaning_fee = case_when(
is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee
))
#skimr::skim(listings)
# find no more missing values for cleaning_fee
In order to still capture the observations that have a missing review_scores_rating but don’t skew the average scores rating, we will calculate the mean of all scores (except NAs) and then convert all NAs into that mean score.
#calculate the mean of all scores except NAs
avg_review_scores_rating <- listings %>%
# pull the review_scores_rating as a list
pull(review_scores_rating) %>%
mean(na.rm = TRUE)
#convert NAs to the mean score
listings <- listings %>%
mutate(review_scores_rating = case_when(
is.na(review_scores_rating) ~ avg_review_scores_rating,
TRUE ~ review_scores_rating
))
#skimr::skim(listings)
# find no more missing values for review_scores_rating
Note: Variables with 100% missing values will be eliminated later)
PROPERTY TYPE
Looking at property_type, we can use the count function to determine how many categories there are their frequency.
#count property types
listings %>%
count(property_type) %>%
arrange(desc(n)) #arrange in descending order of count
| property_type | n |
|---|---|
| Apartment | 9670 |
| House | 7030 |
| Guest suite | 1488 |
| Villa | 1051 |
| Guesthouse | 994 |
| Bed and breakfast | 613 |
| Condominium | 590 |
| Townhouse | 582 |
| Serviced apartment | 465 |
| Cottage | 449 |
| Loft | 317 |
| Boutique hotel | 166 |
| Bungalow | 139 |
| Other | 99 |
| Hostel | 92 |
| Farm stay | 56 |
| Chalet | 48 |
| Tiny house | 34 |
| Cabin | 31 |
| Aparthotel | 30 |
| Hotel | 28 |
| Nature lodge | 14 |
| Camper/RV | 12 |
| Earth house | 12 |
| Dome house | 8 |
| Resort | 7 |
| Boat | 6 |
| Dorm | 5 |
| Hut | 4 |
| Tipi | 4 |
| Barn | 3 |
| Casa particular (Cuba) | 2 |
| Castle | 2 |
| Lighthouse | 2 |
| Treehouse | 2 |
| Bus | 1 |
| Cave | 1 |
| Heritage hotel (India) | 1 |
| Island | 1 |
| Tent | 1 |
| Vacation home | 1 |
| Yurt | 1 |
We find that the top 4 property types are “Apartment” with 9,670 listings, “House” with 7,030 listings, “Guest suite” with 1,488 listings and “Villa” with 1,051 listings. Together these top 4 property types concern 19,239 listings, meaning they account for 80% (19,239 / 24,062) of all listings and are, therefore, most relevant for our analysis.
Since the vast majority of the observations in the data are one of the top four property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.
#mutate to 5 types
listings <- listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Apartment","House", "Guest suite","Villa") ~ property_type,
TRUE ~ "Other"
))
#check prop_type_simplified compared to property_type
listings %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n))
| property_type | prop_type_simplified | n |
|---|---|---|
| Apartment | Apartment | 9670 |
| House | House | 7030 |
| Guest suite | Guest suite | 1488 |
| Villa | Villa | 1051 |
| Guesthouse | Other | 994 |
| Bed and breakfast | Other | 613 |
| Condominium | Other | 590 |
| Townhouse | Other | 582 |
| Serviced apartment | Other | 465 |
| Cottage | Other | 449 |
| Loft | Other | 317 |
| Boutique hotel | Other | 166 |
| Bungalow | Other | 139 |
| Other | Other | 99 |
| Hostel | Other | 92 |
| Farm stay | Other | 56 |
| Chalet | Other | 48 |
| Tiny house | Other | 34 |
| Cabin | Other | 31 |
| Aparthotel | Other | 30 |
| Hotel | Other | 28 |
| Nature lodge | Other | 14 |
| Camper/RV | Other | 12 |
| Earth house | Other | 12 |
| Dome house | Other | 8 |
| Resort | Other | 7 |
| Boat | Other | 6 |
| Dorm | Other | 5 |
| Hut | Other | 4 |
| Tipi | Other | 4 |
| Barn | Other | 3 |
| Casa particular (Cuba) | Other | 2 |
| Castle | Other | 2 |
| Lighthouse | Other | 2 |
| Treehouse | Other | 2 |
| Bus | Other | 1 |
| Cave | Other | 1 |
| Heritage hotel (India) | Other | 1 |
| Island | Other | 1 |
| Tent | Other | 1 |
| Vacation home | Other | 1 |
| Yurt | Other | 1 |
#check only prop_type_simplified
listings %>%
count(prop_type_simplified) %>%
arrange(desc(n))
| prop_type_simplified | n |
|---|---|
| Apartment | 9670 |
| House | 7030 |
| Other | 4823 |
| Guest suite | 1488 |
| Villa | 1051 |
MINIMUM NIGHTS
As Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels, we only want to include listings in our regression analysis that are intended for travel purposes.
#count values for minimum_nights
listings %>%
count(minimum_nights) %>%
arrange(desc(n)) #arrange in descending order of count to find most common values
| minimum_nights | n |
|---|---|
| 2 | 7436 |
| 1 | 6894 |
| 3 | 3527 |
| 7 | 1614 |
| 5 | 1489 |
| 4 | 1098 |
| 10 | 508 |
| 14 | 337 |
| 20 | 252 |
| 6 | 244 |
| 30 | 182 |
| 8 | 62 |
| 90 | 59 |
| 28 | 50 |
| 12 | 47 |
| 15 | 36 |
| 21 | 35 |
| 60 | 33 |
| 180 | 27 |
| 9 | 17 |
| 13 | 12 |
| 31 | 11 |
| 25 | 9 |
| 100 | 8 |
| 120 | 7 |
| 29 | 6 |
| 365 | 6 |
| 11 | 5 |
| 18 | 3 |
| 27 | 3 |
| 45 | 3 |
| 93 | 3 |
| 150 | 3 |
| 360 | 3 |
| 16 | 2 |
| 19 | 2 |
| 22 | 2 |
| 24 | 2 |
| 32 | 2 |
| 95 | 2 |
| 168 | 2 |
| 183 | 2 |
| 185 | 2 |
| 210 | 2 |
| 26 | 1 |
| 33 | 1 |
| 40 | 1 |
| 42 | 1 |
| 89 | 1 |
| 91 | 1 |
| 94 | 1 |
| 124 | 1 |
| 182 | 1 |
| 186 | 1 |
| 240 | 1 |
| 300 | 1 |
| 1.12e+03 | 1 |
The top 3 most common minimum days of stay required by the host are 2 nights with 7,436 observations, 1 night with 6,894 observations and 3 nights with 3,527 observations. Number 4 is 7 nights minimum stay with 1,614 observations.
This clearly shows that hosts are either fine with their guests doing a short-stay around 1-3 days or want them to stay at least a week at 7 days minimum stay.
Ranking place number 7, 8 and 9 are 10, 14 and 20 days respectively. With all this information, it is easy to see that hosts are either okay with short-term stays with frequently new guests and also frequently cleaning and tidying the space, whereas other hosts only want guests for long-term stays (e.g. 1-2 weeks). This is most probably the reason because only long-term stays are profitable for these hosts or they don’t want to tidy the space every second.
To analyse the data, we filter the data set so that it only includes observations with minimum_nights <= 4
#filter minimum nights <= 4
listings <- listings %>%
filter(minimum_nights <= 4)
#check
listings %>%
count(minimum_nights) %>%
arrange(desc(n))
| minimum_nights | n |
|---|---|
| 2 | 7436 |
| 1 | 6894 |
| 3 | 3527 |
| 4 | 1098 |
NEIGHBOURHOODS
We also find that there are several different neighbourhood variables, namely:
neigborhood_overview: gives a descriptionneighbourhood: all NAsneighbourhood_cleansed: indicates which ward listing is inneighbourhood_group_cleansed: all NAs
We will eliminate the the two variables that are just NAs and will try to summarize the neighbourhood_cleansed variable into only a few city areas instead of so many wards. Since Cape Town has 116 wards, we decide to split them into 24 sub councils, which deal with local functions for between three and six wards.
listings <- listings %>%
#creating a new variable to take down sub council names
mutate(sub_council = case_when(
neighbourhood_cleansed %in% c("Ward 23","Ward 29", "Ward 32","Ward 104") ~ 1,
neighbourhood_cleansed %in% c("Ward 6","Ward 7", "Ward 8","Ward 101", "Ward 102", "Ward 111") ~ 2,
neighbourhood_cleansed %in% c("Ward 1","Ward 4", "Ward 5","Ward 70", "Ward 107", "Ward 113") ~ 3,
neighbourhood_cleansed %in% c("Ward 25","Ward 26", "Ward 27","Ward 28", "Ward 30") ~ 4,
neighbourhood_cleansed %in% c("Ward 13","Ward 20", "Ward 24","Ward 31", "Ward 50", "Ward 106") ~ 5,
neighbourhood_cleansed %in% c("Ward 2","Ward 3", "Ward 9","Ward 10", "Ward 12", "Ward 22") ~ 6,
neighbourhood_cleansed %in% c("Ward 21","Ward 103", "Ward 105","Ward 112") ~ 7,
neighbourhood_cleansed %in% c("Ward 83","Ward 85", "Ward 86","Ward 100") ~ 8,
neighbourhood_cleansed %in% c("Ward 18","Ward 87", "Ward 89","Ward 90", "Ward 91", "Ward 116") ~ 9,
neighbourhood_cleansed %in% c("Ward 92","Ward 93", "Ward 94","Ward 97", "Ward 98", "Ward 99") ~ 10,
neighbourhood_cleansed %in% c("Ward 40","Ward 44", "Ward 46","Ward 47") ~ 11,
neighbourhood_cleansed %in% c("Ward 78","Ward 79", "Ward 81","Ward 82") ~ 12,
neighbourhood_cleansed %in% c("Ward 34","Ward 35", "Ward 36","Ward 80", "Ward 88") ~ 13,
neighbourhood_cleansed %in% c("Ward 37","Ward 38", "Ward 39","Ward 41", "Ward 42", "Ward 45") ~ 14,
neighbourhood_cleansed %in% c("Ward 51","Ward 52", "Ward 53","Ward 55", "Ward 56") ~ 15,
neighbourhood_cleansed %in% c("Ward 54","Ward 57", "Ward 74","Ward 77", "Ward 115") ~ 16,
neighbourhood_cleansed %in% c("Ward 48","Ward 49", "Ward 60") ~ 17,
neighbourhood_cleansed %in% c("Ward 63","Ward 65", "Ward 66","Ward 67", "Ward 68", "Ward 110") ~ 18,
neighbourhood_cleansed %in% c("Ward 61","Ward 64", "Ward 69") ~ 19,
neighbourhood_cleansed %in% c("Ward 58","Ward 59", "Ward 62","Ward 71", "Ward 72", "Ward 73") ~ 20,
neighbourhood_cleansed %in% c("Ward 11","Ward 19", "Ward 108") ~ 21,
neighbourhood_cleansed %in% c("Ward 14","Ward 16", "Ward 17","Ward 114") ~ 22,
neighbourhood_cleansed %in% c("Ward 33","Ward 43", "Ward 75","Ward 76") ~ 23,
TRUE ~ 24
))
After consulting Cape Town experts in our group, we found that the 24 sub councils can be further categorised into 6 neighbourhood areas, which namely are Atlantic seaboard, CBD, Outer suburbs, Southern Suburbs, Cape Flats and Sommerset.
Atlantic seaboard- this term is often used by locals to describe the residential/ commercial areas facing the Atlantic seaboard and in front of table mountain, region is often characterized by luxury apartments and homes and offers many tourist attractions such a views from table mountain and beachfront restaurants
CBD-this region is generally considered the city centre and its surrounding areas characterized as the business/ commercial district of cape town.
Outer suburbs- these are regions generally away from the CBD and seaside, land mainly used for industrial purposes.
Southern Suburbs-Cape Town’s Southern Suburbs lie to the Southeast of the slopes of Table Mountain within rich valleys and vast plains reaching from just south of the Table Bay industrial neighbourhoods in the north to the False Bay coastal suburbs and the Cape Peninsula cliffs to the south, and are crossed North-South by the M3 and M5 freeways. In general, this area is identified as being the more affluent of the Cape Town Metropolis’ sections and includes the city’s most expensive residential neighbourhoods.
Cape Flats- this regions are generally characterized by their townships, this includes Mitchells Plainand and Khayelitsha and guguletho some of the regions largest townships, characterized by being very low income and run down.
Sommerset- Sommerset west is one of the largest coastal towns outside of the CBD, and offers both residential and tourist appeal away from the hussle of the city centre.
#change `sub_council` to character form since it represents a name tag
listings$sub_council <- as.character(listings$sub_council)
#summarize all wards from neighbourhood_cleansed to 6 neighbourhood areas
listings <- listings %>%
mutate(neighbourhood_simplified = case_when(
sub_council %in% c(1, 2, 7, 21) ~ "Outer Suburbs",
sub_council %in% c(3, 4, 5, 6) ~ "CBD",
sub_council %in% c(8, 22) ~ "Sommerset",
sub_council %in% c(9, 10, 12, 24) ~ "Cape Flats",
sub_council %in% c(11, 13, 14, 17, 18, 23) ~ "Southern Suburbs",
TRUE ~ "Atlantic Seaboard"
))
#check neighbourhood_simplified compared to sub_council
listings %>%
count(neighbourhood_simplified, sub_council) %>%
arrange(desc(n))
| neighbourhood_simplified | sub_council | n |
|---|---|---|
| Atlantic Seaboard | 16 | 8782 |
| Atlantic Seaboard | 19 | 2065 |
| Atlantic Seaboard | 20 | 1848 |
| CBD | 3 | 1554 |
| Outer Suburbs | 1 | 1050 |
| Atlantic Seaboard | 15 | 788 |
| Cape Flats | 24 | 719 |
| Sommerset | 8 | 670 |
| Outer Suburbs | 7 | 477 |
| CBD | 6 | 279 |
| Outer Suburbs | 2 | 169 |
| Southern Suburbs | 17 | 155 |
| Southern Suburbs | 18 | 124 |
| CBD | 4 | 66 |
| Outer Suburbs | 21 | 66 |
| Sommerset | 22 | 45 |
| Southern Suburbs | 11 | 35 |
| Southern Suburbs | 23 | 20 |
| Cape Flats | 10 | 18 |
| Cape Flats | 12 | 9 |
| Cape Flats | 9 | 9 |
| CBD | 5 | 4 |
| Southern Suburbs | 14 | 3 |
Further, we find these variables regarding location:
- 441 unique values for
street - 259 unique values for
city - 56 unique values for
state - 4 unique values for
market - 2 unique values for
country_code - 2 unique values for
country
As several values for market, country_code and country seem odd, we inspect those values further in the dataset.
We find that market includes the following values:
- “Cape Town”
- “Other International”
- “Krabi”
- “D.C.”
However, they all have “Cape Town” as values for the smart location which is why we can either disregard this column or we assume the names are just wrong so that we can change the three odd names to “Cape Town” also.
#change "Other Internatinal", "Krabi", "D.C." to "Cape Town"
listings <- listings %>%
mutate(market = case_when(
market == "Cape Town" ~ "Cape Town",
TRUE ~ "Cape Town"
))
#check
listings %>%
count(market)
| market | n |
|---|---|
| Cape Town | 18955 |
When further investigating country_code and country, we find that all observations have the value “ZA” or “South Africa” but one observation which has the values “CG” and “Congo” respectively. As this observation’s smart_location is also “Congo”, we eliminate this observation as a wrongly assigned value.
#filter the observations to ensure they contain "ZA" for 'country_code'
listings <- listings %>%
filter(country_code != "CG")
#check
listings %>%
count(country_code)
| country_code | n |
|---|---|
| ZA | 18954 |
We find that 6 values within state are called “.Western Cape” instead of “Western Cape” which we will change in the data cleansing as well.
@SR
#change ".Western Cape" to "Western Cape"
listings$state[listings$state == ".Western Cape"] <- "Western Cape"
listings$state[listings$state == "Cape town"] <- "Cape Town"
listings$state[listings$state == "Cape Town - Western Cape"] <- "Cape Town"
listings$state[listings$state == "Cape Town Western Cape"] <- "Cape Town"
#check
listings %>%
count(state)
| state | n |
|---|---|
| Bellville | 1 |
| Blouberg | 1 |
| Camps Bay | 2 |
| Cape Province | 1 |
| Cape Town | 82 |
| Cape Town, | 1 |
| CBD | 2 |
| Central Bussiness District | 2 |
| De Waterkant | 1 |
| Gardens | 2 |
| Gauteng | 1 |
| Goodwood | 1 |
| Gordons Bay | 4 |
| Green Point | 1 |
| Hout bay | 1 |
| Hout Bay | 1 |
| Muizenberg | 2 |
| Parklands | 1 |
| Rondebosch | 1 |
| SA | 2 |
| Simon's Town in Cape Town | 1 |
| Simons Town | 1 |
| Somerset West | 1 |
| South africa | 1 |
| South Africa | 21 |
| Tarifa | 1 |
| Wc | 1 |
| WC | 1095 |
| WC Hout Bay | 1 |
| We | 1 |
| Wes-Kaap | 43 |
| western cape | 3 |
| Western cape | 6 |
| Western Cape | 17274 |
| Western Cape Town | 1 |
| Western Cape, Cape Town | 1 |
| Western Cape, South Africa | 1 |
| Western Cape/ Rondebosch | 1 |
| Western Peovince | 6 |
| Western province | 1 |
| Western Province | 11 |
| Westkap | 1 |
| Wine Lands South Africa | 1 |
| Woodstock | 6 |
| 364 |
We further find that host_location has several unique values including
- “Cape Town, Western Cape, South Africa”
- “ZA”
- “Parow, Western Cape, South Africa”
listings %>%
count(host_location) %>%
arrange(desc(n))
| host_location | n |
|---|---|
| Cape Town, Western Cape, South Africa | 13442 |
| ZA | 2952 |
| South Africa | 367 |
| Western Cape, South Africa | 325 |
| Johannesburg, Gauteng, South Africa | 193 |
| 93 | |
| Stellenbosch, Western Cape, South Africa | 70 |
| London, England, United Kingdom | 69 |
| GB | 67 |
| Pretoria, Gauteng, South Africa | 62 |
| Durban, KwaZulu-Natal, South Africa | 51 |
| US | 49 |
| Sandton, Gauteng, South Africa | 48 |
| United Kingdom | 41 |
| Hamburg, Hamburg, Germany | 36 |
| KwaZulu-Natal, South Africa | 32 |
| England, United Kingdom | 29 |
| Berlin, Berlin, Germany | 23 |
| DE | 18 |
| Paarl, Western Cape, South Africa | 18 |
| Gauteng, South Africa | 17 |
| Amsterdam, North Holland, Netherlands | 15 |
| Germany | 15 |
| Knysna, Western Cape, South Africa | 14 |
| Dubai, Dubai, United Arab Emirates | 13 |
| Hong Kong | 13 |
| NL | 12 |
| Randburg, Gauteng, South Africa | 12 |
| London, United Kingdom | 11 |
| CA | 9 |
| Ibhayi, Eastern Cape, South Africa | 9 |
| Port Elizabeth, Eastern Cape, South Africa | 9 |
| Brussels, Brussels, Belgium | 8 |
| Cologne, North Rhine-Westphalia, Germany | 8 |
| Netherlands | 8 |
| Paris, Île-de-France, France | 8 |
| Welkom, Free State, South Africa | 8 |
| ES | 7 |
| Frankfurt, Hesse, Germany | 7 |
| Franschhoek, Western Cape, South Africa | 7 |
| IL | 7 |
| New York, New York, United States | 7 |
| Randfontein, Gauteng, South Africa | 7 |
| San Francisco, California, United States | 7 |
| Bern, Canton of Bern, Switzerland | 6 |
| CN | 6 |
| Copenhagen, Denmark | 6 |
| London | 6 |
| Los Angeles, California, United States | 6 |
| Vienna, Vienna, Austria | 6 |
| Zürich, Zurich, Switzerland | 6 |
| Dolphin Coast, KwaZulu-Natal, South Africa | 5 |
| Edinburgh, Scotland, United Kingdom | 5 |
| Liverpool, England, United Kingdom | 5 |
| Robertson, Western Cape, South Africa | 5 |
| Schwabach, Bavaria, Germany | 5 |
| Stockholm, Stockholm County, Sweden | 5 |
| United States | 5 |
| Barcelona, Catalonia, Spain | 4 |
| BE | 4 |
| Cape Town, South Africa | 4 |
| East London, Eastern Cape, South Africa | 4 |
| Edenvale, Gauteng, South Africa | 4 |
| FR | 4 |
| George, Western Cape, South Africa | 4 |
| Germiston, Gauteng, South Africa | 4 |
| Hermanus, Western Cape, South Africa | 4 |
| Hillcrest, KwaZulu-Natal, South Africa | 4 |
| Kimberley, Northern Cape, South Africa | 4 |
| Langebaan, Western Cape, South Africa | 4 |
| Merchtem, Flanders, Belgium | 4 |
| Midrand, Gauteng, South Africa | 4 |
| Munich, Bavaria, Germany | 4 |
| NO | 4 |
| originally from London now living the dream in Cape Town | 4 |
| Oslo, Oslo, Norway | 4 |
| Richards Bay, KwaZulu-Natal, South Africa | 4 |
| SG | 4 |
| Singapore | 4 |
| Sun City, North West, South Africa | 4 |
| Sydney, New South Wales, Australia | 4 |
| Washington, District of Columbia, United States | 4 |
| Wellington, Western Cape, South Africa | 4 |
| Windhoek, Khomas Region, Namibia | 4 |
| 1b Rontree Avenue, Camps Bay CT SA 8005 | 3 |
| AE | 3 |
| Australia | 3 |
| Colenso, KwaZulu-Natal, South Africa | 3 |
| Despatch, Eastern Cape, South Africa | 3 |
| Düsseldorf, North Rhine-Westphalia, Germany | 3 |
| Fort Lauderdale, Florida, United States | 3 |
| Grabouw, Western Cape, South Africa | 3 |
| Grünwald, Bavaria, Germany | 3 |
| Hanover, Lower Saxony, Germany | 3 |
| Hostivice, Central Bohemian Region, Czechia | 3 |
| Howick, KwaZulu-Natal, South Africa | 3 |
| Jeffreys Bay, Eastern Cape, South Africa | 3 |
| Johannesburg South, Gauteng, South Africa | 3 |
| Lichtenburg, North West, South Africa | 3 |
| Melbourne, Victoria, Australia | 3 |
| Mudjimba, Queensland, Australia | 3 |
| Plettenberg Bay, Western Cape, South Africa | 3 |
| Seattle, Washington, United States | 3 |
| Shelley, England, United Kingdom | 3 |
| Simon's Town (False Bay Coast) South Africa | 3 |
| Sir Lowry's Pass, Western Cape, South Africa | 3 |
| Soweto, Gauteng, South Africa | 3 |
| Surrey, United Kingdom | 3 |
| Umhlanga, KwaZulu-Natal, South Africa | 3 |
| Wenzhou, Zhejiang, China | 3 |
| Windmeul, Western Cape, South Africa | 3 |
| Worcester, Western Cape, South Africa | 3 |
| ZW | 3 |
| Abu Dhabi, Abu Dhabi, United Arab Emirates | 2 |
| Alberton, Gauteng, South Africa | 2 |
| Antwerp, Flanders, Belgium | 2 |
| AR | 2 |
| Arlington, Virginia, United States | 2 |
| AT | 2 |
| Auckland, Auckland, New Zealand | 2 |
| Bad Wildungen, Hesse, Germany | 2 |
| Bellingham, Washington, United States | 2 |
| Betty's Bay, Western Cape, South Africa | 2 |
| Between Cape Town and India | 2 |
| Bracknell, England, United Kingdom | 2 |
| Bredasdorp, Western Cape, South Africa | 2 |
| CD | 2 |
| Centurion, Gauteng, South Africa | 2 |
| CH | 2 |
| Claremont, California, United States | 2 |
| De Aar, Northern Cape, South Africa | 2 |
| Dortmund, North Rhine-Westphalia, Germany | 2 |
| Dublin 6, County Dublin, Ireland | 2 |
| Dublin, Dublin, Ireland | 2 |
| Fish Hoek, CPT, South Africa | 2 |
| France | 2 |
| Fuengirola, Andalucía, Spain | 2 |
| Gauteng | 2 |
| Gold Coast, Queensland, Australia | 2 |
| Harare, Harare, Zimbabwe | 2 |
| Hartbeespoort, North West, South Africa | 2 |
| Hoofddorp, North Holland, Netherlands | 2 |
| ID | 2 |
| Kloof, KwaZulu-Natal, South Africa | 2 |
| Lisbon, Lisbon, Portugal | 2 |
| Lodi, Lombardy, Italy | 2 |
| London, UK | 2 |
| Lusaka, Lusaka, Zambia | 2 |
| Manchester, England, United Kingdom | 2 |
| Margate, KwaZulu-Natal, South Africa | 2 |
| Minneapolis, Minnesota, United States | 2 |
| Monaco-Ville, Monaco | 2 |
| Morón, Ciego de Avila, Cuba | 2 |
| Mouzens, Nouvelle-Aquitaine, France | 2 |
| Nairobi, Nairobi County, Kenya | 2 |
| Norwich, England, United Kingdom | 2 |
| Nuremberg, Bavaria, Germany | 2 |
| Oxford, England, United Kingdom | 2 |
| Pandamatenga, North West, Botswana | 2 |
| Perth, Western Australia, Australia | 2 |
| Polokwane, Limpopo, South Africa | 2 |
| Riebeeck Kasteel, Western Cape, South Africa | 2 |
| RO | 2 |
| Saint Francis Bay, Eastern Cape, South Africa | 2 |
| Serramanna, Sardegna, Italy | 2 |
| Sevenoaks, England, United Kingdom | 2 |
| Seychelles | 2 |
| South of France | 2 |
| Still Bay, Western Cape, South Africa | 2 |
| Tel Aviv-Yafo, Tel Aviv District, Israel | 2 |
| Tokyo, Japan | 2 |
| Toronto, Ontario, Canada | 2 |
| UK | 2 |
| Urmston, England, United Kingdom | 2 |
| Vancouver, British Columbia, Canada | 2 |
| Vanderbijlpark, Gauteng, South Africa | 2 |
| Wessobrunn, Bavaria, Germany | 2 |
| West Coast DC, Western Cape, South Africa | 2 |
| White River, Mpumalanga, South Africa | 2 |
| Whitefish, Montana, United States | 2 |
| Williamsport, Pennsylvania, United States | 2 |
| Windhoek, Khomas, Namibia | 2 |
| Wollerau, Canton of Schwyz, Switzerland | 2 |
| 's-Gravenzande, South Holland, Netherlands | 1 |
| 305 Hillside Heights, Wessels Road, Greenpoint, 8005 | 1 |
| Alsheim, Rhineland-Palatinate, Germany | 1 |
| AM | 1 |
| Arklow, County Wicklow, Ireland | 1 |
| Assel, Grevenmacher, Luxembourg | 1 |
| Atlantic Beach, Florida, United States | 1 |
| AU | 1 |
| Austin, Texas, United States | 1 |
| Balearic Islands, Spain | 1 |
| Baltimore, Maryland, United States | 1 |
| Bangkok, Thailand | 1 |
| Bath, England, United Kingdom | 1 |
| Bay Lake, Florida, United States | 1 |
| Beaufort West, Western Cape, South Africa | 1 |
| Beauvechain, Walloon Region, Belgium | 1 |
| Bedford, Eastern Cape, South Africa | 1 |
| Belgrade, Serbia | 1 |
| Bellville, Texas, United States | 1 |
| Belmont, Massachusetts, United States | 1 |
| Berkhamsted, England, United Kingdom | 1 |
| Bethlehem, Maryland, United States | 1 |
| Between Cape Town and Berlin | 1 |
| Bilthoven, Utrecht, The Netherlands | 1 |
| Blagnac, Occitanie, France | 1 |
| Blantyre, Southern Region, Malawi | 1 |
| Bloemendaal, North Holland, Netherlands | 1 |
| Bloemfontein, Free State, South Africa | 1 |
| Bordeaux, Aquitaine-Limousin-Poitou-Charentes, France | 1 |
| Bowden, South Australia, Australia | 1 |
| Breda, North Brabant, Netherlands | 1 |
| Bremen, Bremen, Germany | 1 |
| Brest, Brittany, France | 1 |
| Brighton, England, United Kingdom | 1 |
| British Columbia, Canada | 1 |
| Brooklyn, New York, United States | 1 |
| Caddington, England, United Kingdom | 1 |
| Canmore, Alberta, Canada | 1 |
| Cape Town | 1 |
| Cape Town, Western Cape | 1 |
| Caribbean | 1 |
| Carne, County Wexford, Ireland | 1 |
| Cascais, Lisbon, Portugal | 1 |
| Cayman Islands | 1 |
| Chapel Row, England, United Kingdom | 1 |
| Charleston, South Carolina, United States | 1 |
| Charlotte, North Carolina, United States | 1 |
| Charlottesville, Virginia, United States | 1 |
| Charlton Kings, England, United Kingdom | 1 |
| Chaville, Île-de-France, France | 1 |
| Chesterfield, England, United Kingdom | 1 |
| Chiawa, Lusaka Province, Zambia | 1 |
| Chilton, England, United Kingdom | 1 |
| City of Johannesburg Metropolitan Municipality, Gauteng, South Africa | 1 |
| Clanwilliam, Western Cape, South Africa | 1 |
| Collonges-sous-Salève, Auvergne-Rhône-Alpes, France | 1 |
| Colorado Springs, Colorado, United States | 1 |
| Como, Lombardy, Italy | 1 |
| Compiègne, Nord-Pas-de-Calais-Picardie, France | 1 |
| Corcelles-près-Concise, Vaud, Switzerland | 1 |
| County Dublin, Ireland | 1 |
| County Wicklow, Ireland | 1 |
| Culver City, California, United States | 1 |
| Cumbria, United Kingdom | 1 |
| Czechia | 1 |
| Darling, Western Cape, South Africa | 1 |
| De Steeg, Gelderland, The Netherlands | 1 |
| Denmark | 1 |
| Denver, Colorado, United States | 1 |
| Desio, Lombardy, Italy | 1 |
| Deutschland | 1 |
| DK | 1 |
| Doha, Doha, Qatar | 1 |
| Dordrecht, Zuid-Holland, Netherlands | 1 |
| Dortmund, Nordrhein-Westfalen, Germany | 1 |
| Dundee, Scotland, United Kingdom | 1 |
| Durban Metro, KwaZulu-Natal, South Africa | 1 |
| Durban North, KwaZulu-Natal, South Africa | 1 |
| Durban, South Africa | 1 |
| Elimbah, Queensland, Australia | 1 |
| Emmeloord, Flevoland, Netherlands | 1 |
| Engelskirchen, North Rhine-Westphalia, Germany | 1 |
| England and South Africa | 1 |
| Exeter, England, United Kingdom | 1 |
| Farnham, England, United Kingdom | 1 |
| Fish Hoek | 1 |
| Fish Hoek, South Africa | 1 |
| Flensburg, Schleswig-Holstein, Germany | 1 |
| Florence, Tuscany, Italy | 1 |
| Fouesnant, Brittany, France | 1 |
| Franklin, Tennessee, United States | 1 |
| Freetown, Western Area, Sierra Leone | 1 |
| Frome, England, United Kingdom | 1 |
| Gaborone, South-East, Botswana | 1 |
| Galway, Galway, Ireland | 1 |
| Geneva, Canton of Geneva, Switzerland | 1 |
| Geneva, Geneva, Switzerland | 1 |
| Gilbert, Arizona, United States | 1 |
| Ginosa, Puglia, Italy | 1 |
| GR | 1 |
| Grahamstown, Eastern Cape, South Africa | 1 |
| Granada, Andalusia, Spain | 1 |
| Greyton, Western Cape, South Africa | 1 |
| Groot Brakrivier, Western Cape, South Africa | 1 |
| Gütersloh, North Rhine-Westphalia, Germany | 1 |
| Haarby, Denmark | 1 |
| Halifax, Nova Scotia, Canada | 1 |
| Harare, Harare Province, Zimbabwe | 1 |
| Hatteras, North Carolina, United States | 1 |
| Helsinki, Uusimaa, Finland | 1 |
| Herolds Bay, Western Cape, South Africa | 1 |
| Herrliberg, Canton of Zurich, Switzerland | 1 |
| Heywood, England, United Kingdom | 1 |
| Hilton, KwaZulu-Natal, South Africa | 1 |
| HK | 1 |
| Hong Kong, Hong Kong | 1 |
| Hout Bay, RSA | 1 |
| Houten, Utrecht, The Netherlands | 1 |
| Hyattsville, Maryland, United States | 1 |
| I am so fortunate to live in beautiful Cape Town with nature right on our doorstep. | 1 |
| IE | 1 |
| IN | 1 |
| Innsbruck, Tyrol, Austria | 1 |
| Ireland | 1 |
| IS | 1 |
| IT | 1 |
| Jakarta, Indonesia | 1 |
| Jersey | 1 |
| Jerusalem, Jerusalem District, Israel | 1 |
| Johannesburg, South Africa | 1 |
| JP | 1 |
| Kampala, Central Region, Uganda | 1 |
| Karlskrona, Blekinge County, Sweden | 1 |
| Karlsruhe, Baden-Württemberg, Germany | 1 |
| Katlehong, Gauteng, South Africa | 1 |
| Kelowna, British Columbia, Canada | 1 |
| Kempton Park, Gauteng, South Africa | 1 |
| Kiel, Schleswig-Holstein, Germany | 1 |
| Kinshasa, Kinshasa, Democratic Republic of the Congo | 1 |
| Kirchzarten, Baden-Württemberg, Germany | 1 |
| Kitzbuhel, Tyrol, Austria | 1 |
| Knokke-Heist, Flanders, Belgium | 1 |
| Königstein im Taunus, Hesse, Germany | 1 |
| Kreuzlingen, Thurgau, Switzerland | 1 |
| Krugersdorp, Gauteng, South Africa | 1 |
| Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia | 1 |
| Kuwait City, Al Asimah, Kuwait | 1 |
| Lagos, Nigeria | 1 |
| Langnau am Albis, Zurich, Switzerland | 1 |
| Lausanne, Vaud, Switzerland | 1 |
| Lenham, England, United Kingdom | 1 |
| Leuven, Flanders, Belgium | 1 |
| Liège, Walloon Region, Belgium | 1 |
| Limerick, Limerick, Ireland | 1 |
| Lisse, Zuid-Holland, Netherlands | 1 |
| Lobamba, Hhohho Region, Eswatini | 1 |
| London (West End, Leicester Square, Soho) | 1 |
| London, England | 1 |
| London, Ontario, Canada | 1 |
| London; UK | 1 |
| Londres, Grande-Bretagne | 1 |
| Long Crendon, England, United Kingdom | 1 |
| Loosdrecht, North Holland, Netherlands | 1 |
| LS | 1 |
| Lucerne, Lucerne, Switzerland | 1 |
| Lünen, North Rhine-Westphalia, Germany | 1 |
| Lusaka, Lusaka Province, Zambia | 1 |
| Luxembourg | 1 |
| Luxembourg City, Luxembourg, Luxembourg | 1 |
| Madrid, Community of Madrid, Spain | 1 |
| Magor, Wales, United Kingdom | 1 |
| Mahikeng, North West, South Africa | 1 |
| Málaga, Andalusia, Spain | 1 |
| Malmesbury, Western Cape, South Africa | 1 |
| Männedorf, Zurich, Switzerland | 1 |
| Marburg, Hesse, Germany | 1 |
| Marrakesh, Marrakesh-Tensift-El Haouz, Morocco | 1 |
| Maseru, Maseru, Lesotho | 1 |
| Melbourne, Florida, United States | 1 |
| Melsonby, England, United Kingdom | 1 |
| Mettmann, North Rhine-Westphalia, Germany | 1 |
| Miami, Florida, United States | 1 |
| Middelburg - MP, Mpumalanga, South Africa | 1 |
| Milan, Lombardy, Italy | 1 |
| Mill Valley, California, United States | 1 |
| Millburn, New Jersey, United States | 1 |
| Mombasa, Mombasa, Kenya | 1 |
| Montagu, Western Cape, South Africa | 1 |
| Mossel Bay, Western Cape, South Africa | 1 |
| Mountain Line Farm, The Crags, South Africa | 1 |
| Mountain View, California, United States | 1 |
| Mpumalanga, South Africa | 1 |
| Msunduzi Municipality, KwaZulu-Natal, South Africa | 1 |
| Muizenberg, Western Cape, South Africa | 1 |
| Nairobi, Nairobi, Kenya | 1 |
| Nancy, Lorraine, France | 1 |
| Neuburg am Rhein, Rhineland-Palatinate, Germany | 1 |
| Neuried, Bavaria, Germany | 1 |
| New Delhi, Delhi, India | 1 |
| New South Wales, Australia | 1 |
| Ngorongoro, Arusha, Tanzania | 1 |
| Niederbüren, St. Gallen, Switzerland | 1 |
| Nienhagen, Lower Saxony, Germany | 1 |
| Nkandla, KwaZulu-Natal, South Africa | 1 |
| Noordhoek, Fishhoek our closest town | 1 |
| Noordwijk, South Holland, Netherlands | 1 |
| North West, South Africa | 1 |
| Nottingham Road, KwaZulu-Natal, South Africa | 1 |
| Nottingham, England, United Kingdom | 1 |
| Nyon, Vaud, Switzerland | 1 |
| NZ | 1 |
| Oakland, California, United States | 1 |
| Oberschleißheim, Bavaria, Germany | 1 |
| Olathe, Kansas, United States | 1 |
| Olifantsfontein, Gauteng, South Africa | 1 |
| OM | 1 |
| Omaha, Nebraska, United States | 1 |
| Onrus, Western Cape, South Africa | 1 |
| Oosterbeek, Gelderland, Netherlands | 1 |
| Orgeval, Île-de-France, France | 1 |
| Ottawa, Ontario, Canada | 1 |
| Ottersberg, Lower Saxony, Germany | 1 |
| Oxon Hill, Maryland, United States | 1 |
| Palo Alto, California, United States | 1 |
| Parow, Western Cape, South Africa | 1 |
| Parys, Free State, South Africa | 1 |
| Patricia and I live in a large house in central Cape Town which has an apartment attached. We are a few streets from Table Mountain National Park and within walking distance of the city. But there's also a good bus service and nearby shops & restaurants. | 1 |
| Philippines | 1 |
| Pietermaritzburg, KwaZulu-Natal, South Africa | 1 |
| Pontpierre, Luxembourg District, Luxembourg | 1 |
| Poole, England, United Kingdom | 1 |
| Port Alfred, Eastern Cape, South Africa | 1 |
| Port Louis, Port Louis District, Mauritius | 1 |
| Port Saint Mary, Isle of Man | 1 |
| Portland, Maine, United States | 1 |
| Posadas, Misiones Province, Argentina | 1 |
| Potsdam, Brandenburg, Germany | 1 |
| Prague, Hlavní město Praha, Czech Republic | 1 |
| Principauté de Monaco | 1 |
| Prospect, South Australia, Australia | 1 |
| QA | 1 |
| Queensland, Australia | 1 |
| Richmond, England, United Kingdom | 1 |
| Riebeek West, South Africa | 1 |
| Rixensart, Walloon Region, Belgium | 1 |
| Rome, Lazio, Italy | 1 |
| Rotterdam, South Holland, Netherlands | 1 |
| Rotterdam, Zuid-Holland, The Netherlands | 1 |
| RU | 1 |
| Rüsselsheim, Hesse, Germany | 1 |
| SA | 1 |
| Saarbrücken, Saarland, Germany | 1 |
| Sachsen bei Ansbach, Bavaria, Germany | 1 |
| Saint Petersburg, Florida, United States | 1 |
| Saint-Lubin-des-Joncherets, Centre, France | 1 |
| Salt Spring Island, British Columbia, Canada | 1 |
| San Diego, California, United States | 1 |
| San Francisco (formerly Cape Town) | 1 |
| San Gimignano, Tuscany, Italy | 1 |
| Santa Cruz, California, United States | 1 |
| Santa Monica, California, United States | 1 |
| São Paulo, State of São Paulo, Brazil | 1 |
| Sautel, Occitanie, France | 1 |
| Schönau im Mühlkreis, Upper Austria, Austria | 1 |
| Schülp bei Rendsburg, Schleswig-Holstein, Germany | 1 |
| Seaforth, New South Wales, Australia | 1 |
| Seefeld, Bavaria, Germany | 1 |
| Sewickley, Pennsylvania, United States | 1 |
| Simon's Town is located on the shores of False Bay on the Eastern side of the Cape Peninsula, a 45 minute drive from Cape Town | 1 |
| Singapore, Republic of Singapore | 1 |
| Singapore, Singapore | 1 |
| Sliema, Malta | 1 |
| Soustons, Aquitaine-Limousin-Poitou-Charentes, France | 1 |
| South Africa | 1 |
| south africa | 1 |
| Southlake, Texas, United States | 1 |
| Springs, Gauteng, South Africa | 1 |
| Standerton, Mpumalanga, South Africa | 1 |
| Standish, England, United Kingdom | 1 |
| Starnberg, Bavaria, Germany | 1 |
| Stavanger, Rogaland, Norway | 1 |
| Stuart, Florida, United States | 1 |
| Stuttgart, Baden-Württemberg, Germany | 1 |
| Styria, Austria | 1 |
| Suamico, Wisconsin, United States | 1 |
| Sugar Land, Texas, United States | 1 |
| Swakopmund, Erongo Region, Namibia | 1 |
| Sweden | 1 |
| Tain-l'Hermitage, Rhone-Alpes, France | 1 |
| The Hague, South Holland, Netherlands | 1 |
| The Hague, Zuid-Holland, The Netherlands | 1 |
| The most beautiful city in the world. | 1 |
| The Sands, England, United Kingdom | 1 |
| This is our home | 1 |
| Tulbagh, Western Cape, South Africa | 1 |
| Umeå, Västerbotten County, Sweden | 1 |
| United Arab Emirates | 1 |
| Upington, Northern Cape, South Africa | 1 |
| USA | 1 |
| Utrecht, Utrecht, Netherlands | 1 |
| Vaalwater, Limpopo, South Africa | 1 |
| Vestby, Akershus, Norway | 1 |
| Vilnius, Vilnius County, Lithuania | 1 |
| Vredendal, Western Cape, South Africa | 1 |
| Wales, United Kingdom | 1 |
| Walvis Bay, Erongo Region, Namibia | 1 |
| Wartburg, KwaZulu-Natal, South Africa | 1 |
| Weingarten, Baden-Württemberg, Germany | 1 |
| Wellington, Wellington, New Zealand | 1 |
| West Beach, South Australia, Australia | 1 |
| West Chicago, Illinois, United States | 1 |
| West Horsley, England, United Kingdom | 1 |
| Western Cape | 1 |
| Wilderness, Western Cape, South Africa | 1 |
| Winchester, England, United Kingdom | 1 |
| Winchester, Massachusetts, United States | 1 |
| Wokingham, England, United Kingdom | 1 |
| Wolseley, Western Cape, South Africa | 1 |
| York, England, United Kingdom | 1 |
| Zell am See, Salzburg, Austria | 1 |
| ZM | 1 |
| Zug, Canton of Zug, Switzerland | 1 |
| 한국 | 1 |
CATEGORICAL VARIABLES
Lastly, we find several categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values).
host_response_time(as specified by Airbnb, e.g. “within an hour”, “within a day”)host_response_rate(between 0% and 100%)host_acceptance_rate(between 0% and 100%)host_is_superhost(TRUE or FALSE)host_has_profile_pic(TRUE or FALSE)host_identity_verified(TRUE or FALSE)neighbourhood_cleansed(limited number of Wards)city(limited set of cities that belong to the area)state(only one state, “Western Cape”)zipcode(limited set of zip codes for the area)market(should all be “Cape Town”)country_code(should all be “ZA”)country(should all be “South Africa”)
TEXT VARIABLES
We further find these following variables which are filled with descriptive text or are just URL-addresses and will, thus, also not be relevant to our analysis.
namesummaryspacedescriptionneighborhood_overviewnotestransitaccessinteractionhouse_ruleshost_aboutlisting_urlthumbnail_url(all NAs)medium_url(all NAs)picture_urlxl_picture_url(all NAs)host_urlhost_thumbnail_urlhost_picture_url
listings_version2 <- listings %>%
#deselecting all text variables
dplyr::select(-name, -summary, -space, -description,
-neighborhood_overview, -notes, -transit, -access,
-interaction, -house_rules, -host_about, -listing_url,
-thumbnail_url, -medium_url, -picture_url, -xl_picture_url,
-host_url, -host_thumbnail_url, -host_picture_url)
MISSING VALUES 2
While some variables have 0 missing values, a lot of variables have some missing values. However, we do find 8 variables that miss all values, namely that only (or for one mainly) include NAs (Not Available). We will eliminate these when cleaning the data set.
host_neighbourhoodthumbnail_urlmedium_urlxl_picture_urlneighbourhoodneighbourhood_group_cleansedjurisdiction_nameslicense(mainly NAs except for a few irrelevant values)
listings_version3 <- listings_version2 %>%
#deselecting variables with only or mainly NAs
dplyr::select(-host_neighbourhood, -neighbourhood, -neighbourhood_group_cleansed,
-jurisdiction_names, -license)
We can also see that all observations have a value for the experiences_offered variable. As there is only 1 unique value, which is “none”, we will can discard this variable, too as it adds no value to our analysis.
listings_version3 <- listings_version3 %>%
#deselecting the column `experiences_offered`
dplyr::select(-experiences_offered)
The target variable is the cost for two people to stay at an Airbnb location for four (4) nights, namely (price+extra_people)*4. So we create a new variable price_4_nights in our dataset ‘listings_version3’.
listings_version3 <- listings_version3 %>%
#creating a new variable `price_4_nights`
mutate(price_4_nights = ifelse(guests_included < 2, 4*(price+extra_people) + cleaning_fee,
price*4+cleaning_fee))
Favstats
Lastly, we will have a look at favstats() of the numeric variables that we specified as relevant earlier:
#favstats related numerical variables to examine rough distribution/outliers
favstats(listings_version3$price)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 121 | 624 | 988 | 1.7e+03 | 3e+05 | 2.12e+03 | 7.54e+03 | 18954 | 0 |
favstats(listings_version3$price_4_nights)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 484 | 2.81e+03 | 4.21e+03 | 7.21e+03 | 1.2e+06 | 8.8e+03 | 3.02e+04 | 18954 | 0 |
favstats(listings_version3$number_of_reviews)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 4 | 17 | 425 | 16.4 | 31.2 | 18954 | 0 |
favstats(listings_version3$review_scores_rating)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 20 | 94.6 | 95 | 99 | 100 | 94.5 | 7.5 | 18954 | 0 |
favstats(listings_version3$accommodates)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 40 | 3.57 | 2.31 | 18954 | 0 |
favstats(listings_version3$longitude)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 18.3 | 18.4 | 18.4 | 18.5 | 18.9 | 18.5 | 0.126 | 18954 | 0 |
favstats(listings_version3$latitude)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| -34.3 | -34 | -33.9 | -33.9 | -33.5 | -34 | 0.0998 | 18954 | 0 |
It seems there exist fairly isolated outliers for these numeric variables.
Meanwhile, we can double check from favstate results that all of the 7 variables have 18954 counts, namely no missing values after we have selected the observations with minimum_nights <= 4.
The maximum of price is 300,009, almost 300 (300,009/988) times higher than the median. The standard deviation for price is 7,537.011, signaling high variability.
For price_4_nights , the max is 1,200,036; the mean (8801) is larger than median (4212), indicating that the distribution could be right-skewed.
For number_of_reviews, mean is similar to Q3 and the maximum is way higher than median. This is possibly because a large number of “0”s exist in the column.
For review_scores_rating, most values are within 94.5 and 99, and the mean is 94, indicating that review_scores_rating are fairly high for Cape Town properties.
For accommodates, most properties can have 2 to 4 people, while the largest accommodates 40.
For longitude(latitude), all values are clustered around “18.4”(“-34.0”). This verifies that we are getting right observations in one certain area, namely Cape Town.
Overall, the results of favstats show there are outliers that could distrupt our analysis. Thus in each visualization, we would like to limit the numerical values to its [quantile 10%, quantile 90%].
Informative visualisations
Graphs of variables of interest
Density Plot
- price
The distribution is right-skewed and prices (per night) are concentrated around $550.
plot_of_price <- listings_version3 %>%
filter(price < quantile(price,0.9), price > quantile(price,0.1)) %>%
ggplot(aes(x=price)) +
geom_density()+
labs(title="Distribution of price")+
scale_x_continuous(breaks = seq(0,4000,500))+
theme_bw()
plot_of_price

- price_4_nights
The distribution has one peak and price for 4 nights (for two people actually) are concentrated around $25000.
# remove scientific notation
options(scipen=999)
# plot
plot_of_price_4_nights <- listings_version3 %>%
filter(cleaning_fee < quantile(price_4_nights,0.9), cleaning_fee > quantile(price_4_nights,0.1)) %>%
ggplot(aes(x=price_4_nights)) +
geom_density()+
labs(title="Distribution of price_4_nights")+
scale_x_continuous(breaks = seq(0,220000,25000))+
theme_bw()
plot_of_price_4_nights

- guest_included
The distribution has multiple peaks, indicating that for the price of most properties, only 1 or 2 guests are included. It is likely that when more than two people want to book a property, they will have to pay the extra money (extra_people).
plot_of_guests_included <- listings_version3 %>%
#filter(guests_included < quantile(guests_included,0.9), guests_included > quantile(guests_included,0.1)) %>%
ggplot(aes(x=guests_included)) +
geom_density()+
labs(title="Distribution of guest included")+
scale_x_continuous(breaks = seq(0,20,1))+
theme_bw()
plot_of_guests_included

- extra_people
The distribution has four peaks, indicating that the charge for one extra person are likely to be $100, $150, $200, $250.
plot_of_extra_people <- listings_version3 %>%
filter(extra_people < quantile(extra_people,0.9), extra_people > quantile(extra_people,0.1)) %>%
ggplot(aes(x=extra_people)) +
geom_density()+
labs(title="Distribution of extra people")+
theme_bw()
plot_of_extra_people

- minimum_nights
Most properties require the stay to be around one to two nights, some require three nights. This corresponds to what we have analysed in the “2.3.1 Skim-Minimum nights” section. Hence we get rid of the observations with over 4 nights.
plot_of_minimum_nights <- listings_version3%>%
ggplot(aes(x=minimum_nights)) +
geom_density()+
labs(title="Distribution of minimum nights")+
scale_x_continuous(breaks = seq(0,5,1))+
theme_bw()
plot_of_minimum_nights

- number_of_reviews
The distribution is bi-modal, indicating that the numbers of reviews are concentrated below 5.
plot_of_number_of_reviews <- listings_version3%>%
filter(number_of_reviews < quantile(number_of_reviews,0.9), number_of_reviews > quantile(number_of_reviews,0.1)) %>%
ggplot(aes(x=number_of_reviews)) +
geom_density()+
labs(title="Distribution of number of reviews")+
scale_x_continuous(breaks = seq(0,55,10))+
theme_bw()
plot_of_number_of_reviews

- review_scores_rating
Most review scores rating is above 90, and are most concentrated around 97.5, indicating that most properties receive decent ratings.
plot_of_review_scores_rating <- listings_version3%>%
filter(review_scores_rating < quantile(review_scores_rating,0.9), review_scores_rating > quantile(review_scores_rating,0.1)) %>%
ggplot(aes(x=review_scores_rating)) +
geom_density()+
labs(title="Distribution of review scores rating")+
theme_bw()
plot_of_review_scores_rating

- host_listings_count
Most host have around 1 to 10 listings, and the peak is at 1.
plot_of_host_listings_count <- listings_version3%>%
ggplot(aes(x=host_listings_count)) +
geom_density()+
labs(title="Distribution of host listings count")+
xlim(c(0,50)) +
theme_bw()
plot_of_host_listings_count

- security_deposit
Most properties do not require security deposit: NA + 0 consist of 13182/18964 (69.5%) of the whole dataset.
plot_of_security_deposit <- listings_version3%>%
ggplot(aes(x=security_deposit)) +
geom_density()+
labs(title="Distribution of security deposit")+
xlim(c(0,10)) +
theme_bw()
plot_of_security_deposit

histogram
For histograms, we choose to plot accommodates (the maximum number of people) as it reflects beds, bathrooms, and bedrooms combined.
- accommodates
Overall, most accommodates are concentrated around 2 and 4. But the scenario varies from type to type. Because we filtered out outliers before when creating the listings_version3_filtered, we are losing information on villa properties. So we plot an additional histogram for villa accommodates.
plot_of_accommodates <- listings_version3 %>%
ggplot(aes(x=accommodates)) +
geom_histogram(binwidth = 1)+
labs(title="Distribution of accommodates")+
xlim(0,15)+
#scale_x_continuous(breaks = seq(1,14,1))+
theme_bw()
plot_of_accommodates

plot_of_accommodates_by_prop_type <- plot_of_accommodates+
facet_wrap(~prop_type_simplified,scale="free")+
labs(title="Distribution of accommodates by property type")
plot_of_accommodates_by_prop_type

Explore villa accommodates
As the histograms shows, villas accommodate more people than the rest as a large proportion of values are concentrated around 8.
listings %>%
filter(prop_type_simplified=="Villa") %>%
count()
| n |
|---|
| 710 |
plot_of_accommodates_villa <- listings_version3 %>%
filter(prop_type_simplified=="Villa") %>%
ggplot(aes(x=accommodates)) +
geom_histogram(binwidth = 1)+
labs(title="Distribution of villa accommodates")+
#xlim(0,15)+
scale_x_continuous(breaks = seq(1,20,1))+
theme_bw()
plot_of_accommodates_villa

Bar Plot
For the bar plots we are using the original dataset listings_version3 because we do not want to lose too much data.
- room_type
Most of the rooms are entire home/apartment, followed by private room.
data_of_room_type <- listings_version3%>%
group_by(room_type) %>%
summarise(count = n()) %>%
mutate(room_type = fct_reorder(room_type, desc(count)))
plot_of_room_type <- data_of_room_type %>%
ggplot(aes(x=room_type,y=count)) +
geom_bar(stat = "identity")+
geom_text(aes(label=count), vjust=1.6, color="white", size=3.5)+
theme_bw()
plot_of_room_type

- neighbourhood_simplified
data_of_neighbourhood_simplified <- listings_version3 %>%
group_by(neighbourhood_simplified) %>%
summarise(count = n()) %>%
mutate(neighbourhood_simplified = fct_reorder(neighbourhood_simplified, desc(count)))
plot_of_neighbourhood_simplified <- data_of_neighbourhood_simplified %>%
ggplot(aes(x=neighbourhood_simplified,y=count)) +
labs(title="Most properties are located in Atlantic Seaboard")+
geom_bar(stat = "identity")+
theme_bw()
plot_of_neighbourhood_simplified

- prop_type_simplified
data_of_prop_type_simplified <- listings_version3 %>%
group_by(prop_type_simplified) %>%
summarise(count = n()) %>%
mutate(prop_type_simplified= fct_reorder(prop_type_simplified,desc(count)))
plot_of_prop_type_simplified <- data_of_prop_type_simplified %>%
ggplot(aes(x=prop_type_simplified,y=count)) +
labs(title="Most properties are apartment, followed by house")+
geom_bar(stat = "identity")+
theme_bw()
plot_of_prop_type_simplified

- host_is_superhost
plot_of_host_is_superhost <- listings_version3%>%
filter(!is.na(host_is_superhost)) %>%
ggplot(aes(x=host_is_superhost)) +
labs(title="Most hosts are not superhost")+
geom_bar()+
theme_bw()
plot_of_host_is_superhost

- host_identity_verified
plot_of_host_identity_verified <- listings_version3%>%
filter(!is.na(host_identity_verified)) %>%
ggplot(aes(x=host_identity_verified)) +
labs(title="Most hosts are not verified")+
geom_bar()+
theme_bw()
plot_of_host_identity_verified

- cancellation_policy
Most properties have a cancellation policy of strict 14 days with grace period.
# check out the type of cancellation policy
typeof(listings_version3$cancellation_policy)
## [1] "character"
# make it a factor variable and prepare the data for plotting
data_of_cancellation_policy <- listings_version3 %>%
dplyr::select(cancellation_policy) %>%
mutate(cancellation_policy_f = as.factor(cancellation_policy)) %>%
group_by(cancellation_policy_f) %>%
summarise(count = n()) %>%
mutate(cancellation_policy_f= fct_reorder(cancellation_policy_f,desc(count)))
# plot
plot_of_cancellation_policy <- data_of_cancellation_policy %>%
ggplot(aes(x=cancellation_policy_f,y=count)) +
geom_bar(stat = "identity")+
theme_bw()
plot_of_cancellation_policy

Correlation matrix
To measure correlation between variables, we use correlation matrix to generate realationship plots and correlation coefficents, which measure both the strength and direction of the linear relationship between two continuous variables.
Before plotting
After inspecting the cleaned dataset, we want to focus on the main range of each variable to detect obvious relationships for further informative visualisations. So, we decide to limit values to 10%-90% percentile.
#quantile(listings_version3$price, probs = c(0.1,0.9))
#quantile(listings_version3$cleaning_fee, probs = c(0.1,0.9))
#quantile(listings_version3$extra_people, probs = c(0.1,0.9))
#quantile(listings_version3$number_of_reviews, probs = c(0.1,0.9))
#quantile(listings_version3$review_scores_rating, probs = c(0.1,0.9))
listings_version3_filtered <- listings_version3 %>%
filter(price_4_nights < quantile(price_4_nights,0.9), price_4_nights > quantile(price_4_nights,0.1),
number_of_reviews < quantile(number_of_reviews,0.9), number_of_reviews > quantile(number_of_reviews,0.1),
review_scores_rating < quantile(review_scores_rating,0.9), review_scores_rating > quantile(review_scores_rating,0.1),
accommodates < quantile(accommodates,0.9), accommodates > quantile(accommodates,0.1))
There is one thing to notice: this listings_version3_filtered is an intersection of all the related variables within [quantile 10%, quantile 90%]. Consequently this is specifically for the production of informative plots within several variables at one time, such as “correlation matrix”.
Correlation matrix in numeric variables
# produce scatterplot-correlation matrix using GGally::ggpairs()
# remove scientific notation
options(scipen=999)
# listings
correlation_matrix1 <- listings_version3_filtered %>%
dplyr::select(price_4_nights, number_of_reviews, review_scores_rating, accommodates) %>%
GGally::ggpairs(alpha = 0.4)
#load library here
library(here)
ggsave("correlation_matrix1.jpg",plot = correlation_matrix1 ,path=here::here("images"), width = 17, height = 10)
knitr::include_graphics(here::here("images", "correlation_matrix1.jpg"))

Interpreting correlations between variables
From the correlation matrix above, we can see thataccommodates and price_4_nights are positively correlated with an coefficient of 0.234.
Significant correlation also exists between number_of_reviews & review_scores_rating(0.354). number_of_reviews is significantly correlated with accomodates, sharing a correlation coefficient of -0.075** .
Correlation matrix in neighbourhood_simplified and numerical variables
# produce scatterplot-correlation matrix using GGally::ggpairs()
# remove scientific notation
options(scipen=999)
correlation_matrix2 <-listings_version3_filtered %>%
dplyr::select(price_4_nights, number_of_reviews, review_scores_rating, accommodates,neighbourhood_simplified) %>%
GGally::ggpairs(aes(color = neighbourhood_simplified), alpha = 0.4)
#load library here
library(here)
ggsave("correlation_matrix2.jpg", plot = correlation_matrix2, path=here::here("images"), width = 17, height = 10)
knitr::include_graphics(here::here("images", "correlation_matrix2.jpg"))

Interpreting correlations between variables
Categorizing by neighbourhood_simplified, we can specify clear correlations between numerical variables in details.
price_4_nights is significantly correlated with number_of_reviews in “Atlantic Seaboard”(-0.071) and “CBD”(-0.276),
price is significantly correlated with review_scores_rating in “Southern Suburbs”(0.569).
The above two types of trends show that different neighbourhoods do have different levels of impact on the relationship between price_4_nights and number_of_reviews / review_scores_rating.
Take “Southern Suburbs” as an example. Cape Town’s “Southern Suburbs” is identified as being the more affluent of the Cape Town Metropolis’ sections and includes the city’s most expensive residential neighbourhoods. Hence here may exist more evident linear relationship between price_4_nights and review_scores_rating since people think of the experience of the trip and accommodation fully deserves the price. We will further look at whether this correlation appears to be conditional on the value of neighbourhood_simplified below.
From the three boxplots above, we can conclude the distribution of price_4_nights, number_of_reviews and review_scores_rating among different neighbourhood_simplified. Listings in “Atlantic seaboard” always rank ahead in the price_4_nights and number_of_reviews.
While price_4_nights and number_of_reviews of listings in “Atlantic Seaboard” are above average, review_scores_rating in this area is inconsistently low. From our knowledge, “Atlantic Seaboard” is often used by locals to describe the residential/ commercial areas facing the Atlantic seaboard and in front of table mountain. And the low rating may be because people do not think of luxury apartments as a good bargain.
Correlation matrix in prop_type_simplified and numerical variables
# produce scatterplot-correlation matrix using GGally::ggpairs()
# remove scientific notation
options(scipen=999)
correlation_matrix3 <-listings_version3_filtered %>%
dplyr::select(price_4_nights, number_of_reviews, review_scores_rating, accommodates,prop_type_simplified) %>%
GGally::ggpairs(aes(color = prop_type_simplified), alpha = 0.4)
# load library here
library(here)
ggsave("correlation_matrix3.jpg", plot = correlation_matrix3, path=here::here("images"), width = 17, height = 10)
knitr::include_graphics(here::here("images", "correlation_matrix3.jpg"))

Interpreting correlations between variables
Categorizing by prop_type_simplified, we can specify clear correlations between numerical variables in details. Indeed, different types of property do have different levels of impact on the relationship between price and other variable.
price_4_nights is significantly correlated with number_of_reviews in “guest suite” (0.188), while being non-significant in other property types. Overall, price_4_nights and review_scores_rating are significantly correlated: apartment (0.196), house(0.261), other(0.229).
Regarding the relationship between review_scores_rating and number_of_reviews, there exists a significant correlation however less obvious among villas.
Now let’s delve into the property type. From the boxplots above, we can conclude the distribution of price_4_nights, number_of_reviews and review_scores_rating among different prop_type_simplified. Overall, “Guests suide” have the highest number_of_reviews and review_scores_rating although the price_4_nights for it is fairly low. This is probably because “Guests suide” has fewer data points. Meanwhile, people might think of it as the most rewarding experience with good price.
Correlation matrix in room_type and numerical variables
# produce scatterplot-correlation matrix using GGally::ggpairs()
# remove scientific notation
options(scipen=999)
correlation_matrix4 <-listings_version3_filtered %>%
# filter variables of interest
dplyr::select(price_4_nights, number_of_reviews, review_scores_rating, accommodates,room_type) %>%
GGally::ggpairs(aes(color = as.factor(room_type)), alpha = 0.4)
# load library here
library(here)
ggsave("correlation_matrix4.jpg", plot = correlation_matrix4, path=here::here("images"), width = 17, height = 10)
knitr::include_graphics(here::here("images", "correlation_matrix4.jpg"))

Interpreting correlations between variables
Categorizing by prop_type_simplified, we can specify clear correlations between numerical variables in details. Indeed, different types of property do have different levels of impact on the relationship between price and other variable.
price_4_nights ’s negative correlation with number_of_reviews is only significant among entire home/apartment (-0.004), but the positive correlation with accommodates is significant among all property types except for hotel rooms. It could be that hotel rooms usually charge under their operation rules so the impact of accommodates are less significant.
Regarding the boxplots, it is evident that the price_4_nights varies from type to type, with hotel room being the most expensive. Please note that we filtered data here so the findings are just for reference.
Correlations conditional on categorical variable
Plot1
price_4_nights X number_of_reviews by neighborhood_simplified
price_4_nights_number_of_reviews_1 <- listings_version3 %>%
filter(price_4_nights < quantile(price_4_nights,0.9), price_4_nights > quantile(price_4_nights,0.1)) %>%
filter(number_of_reviews < quantile(number_of_reviews,0.9), number_of_reviews > quantile(number_of_reviews,0.1)) %>%
ggplot2::ggplot(aes(x = number_of_reviews , y = price_4_nights))+
geom_point()+
geom_smooth()+
facet_wrap(~neighbourhood_simplified)+
labs(title = "'CBD' and 'Atlantic Seaboard' sees strong relationship between price_4_nights and number_of_reviews",
subtitle = "correlation comparison conditional on neighbourhoods",
x = "number_of_reviews")+
theme_bw()+
NULL
ggsave("price_4_nights_number_of_reviews_1.jpg", plot = price_4_nights_number_of_reviews_1, path=here::here("images"), width = 15, height = 10)
knitr::include_graphics(here::here("images", "price_4_nights_number_of_reviews_1.jpg"))
The graph above appears to be indicating correlations conditional on neighbourhoods.“Cape Flats”, “Sommerset”, and “Southern Suburbs” have fewer observation points than “Atlantic Seaboard” and “CBD”, which might be a guess answer to the weak correlations.
It can be concluded that properties in different regions attract visitors to leave reviews at different levels, though probably not that obvious considering they still belong to the same metropolitan. For instance, people who go on business trip tend not to leave a review, while leisure travellers tend to do that.
Plot2
price_4_nights X accommodates by neighborhood_simplified
price_4_nights_accommodates_byneighborhood <- listings_version3 %>%
filter(price_4_nights < quantile(price_4_nights,0.9), price_4_nights > quantile(price_4_nights,0.1)) %>%
filter(accommodates < quantile(accommodates,0.9), accommodates > quantile(accommodates,0.1)) %>%
ggplot2::ggplot(aes(x = as.factor(accommodates), y = price_4_nights))+
geom_boxplot()+
facet_wrap(~neighbourhood_simplified)+
#scale_x_continuous(limits = c(75,300))+
#scale_y_continuous(limits = c(0,3000))+
labs(title = " 'Atlantic Seaboard' sees highest average price_4_nights across accommodates",
subtitle = "Correlation comparison conditional on neighbourhoods",
x = "accommodates")+
theme_bw()+
NULL
ggsave("price_4_nights_accommodates_byneighborhood.jpg", plot = price_4_nights_accommodates_byneighborhood, path=here::here("images"), width = 15, height = 10)
knitr::include_graphics(here::here("images", "price_4_nights_accommodates_byneighborhood.jpg"))

Plot3
price_4_nights X accommodates by prop_type_simplified
price_4_nights_accommodates_byprop<- listings_version3 %>%
filter(price_4_nights < quantile(price_4_nights,0.9), price_4_nights > quantile(price_4_nights,0.1)) %>%
filter(accommodates < quantile(accommodates,0.9), accommodates > quantile(accommodates,0.1)) %>%
ggplot2::ggplot(aes(x = as.factor(accommodates), y = price_4_nights))+
geom_boxplot()+
facet_wrap(~prop_type_simplified)+
#scale_x_continuous(limits = c(75,300))+
#scale_y_continuous(limits = c(0,3000))+
labs(title = " 'House' sees most significant positive relationship ",
subtitle = "Correlation comparison conditional on property type",
x = "accommodates")+
theme_bw()+
NULL
ggsave("price_4_nights_accommodates_byprop.jpg", plot = price_4_nights_accommodates_byprop, path=here::here("images"), width = 15, height = 10)
knitr::include_graphics(here::here("images", "price_4_nights_accommodates_byprop.jpg"))

Mapping
Prepare the dataset for mapping
library(kableExtra)
# check out the unique values for each variable
# for property_type
unique(listings$property_type)
## [1] "Guesthouse" "House" "Apartment"
## [4] "Other" "Guest suite" "Villa"
## [7] "Serviced apartment" "Bed and breakfast" "Bungalow"
## [10] "Loft" "Cottage" "Condominium"
## [13] "Hotel" "Townhouse" "Hostel"
## [16] "Boutique hotel" "Cabin" "Chalet"
## [19] "Aparthotel" "Tiny house" "Farm stay"
## [22] "Lighthouse" "Treehouse" "Tent"
## [25] "Camper/RV" "Resort" "Nature lodge"
## [28] "Cave" "Earth house" "Hut"
## [31] "Boat" "Bus" "Vacation home"
## [34] "Castle" "Casa particular (Cuba)" "Dorm"
## [37] "Heritage hotel (India)" "Tipi" "Barn"
## [40] "Dome house"
table(listings$property_type) %>%
kbl()
| Var1 | Freq |
|---|---|
| Aparthotel | 29 |
| Apartment | 8046 |
| Barn | 2 |
| Bed and breakfast | 606 |
| Boat | 6 |
| Boutique hotel | 162 |
| Bungalow | 107 |
| Bus | 1 |
| Cabin | 27 |
| Camper/RV | 9 |
| Casa particular (Cuba) | 2 |
| Castle | 2 |
| Cave | 1 |
| Chalet | 45 |
| Condominium | 470 |
| Cottage | 385 |
| Dome house | 1 |
| Dorm | 5 |
| Earth house | 11 |
| Farm stay | 50 |
| Guest suite | 1402 |
| Guesthouse | 958 |
| Heritage hotel (India) | 1 |
| Hostel | 91 |
| Hotel | 27 |
| House | 4513 |
| Hut | 2 |
| Lighthouse | 2 |
| Loft | 281 |
| Nature lodge | 14 |
| Other | 86 |
| Resort | 5 |
| Serviced apartment | 441 |
| Tent | 1 |
| Tiny house | 32 |
| Tipi | 4 |
| Townhouse | 414 |
| Treehouse | 2 |
| Vacation home | 1 |
| Villa | 710 |
# use listings because we need the url
listings_formapping <- listings %>%
dplyr::select(longitude, latitude, listing_url, property_type,price,minimum_nights,prop_type_simplified)
# before mapping color to price, check price distribution
# check summary statistics
favstats(listings_formapping$price)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 121 | 624 | 988 | 1.7e+03 | 3e+05 | 2.12e+03 | 7.54e+03 | 18954 | 0 |
# Quantile 10% to quantile 90%
quantile(listings_formapping$price,probs = c(0.1,0.9))
## 10% 90%
## 416 3453
# filtering some data
listings_formapping%>%
ggplot(aes(x=price)) +
geom_histogram(fill='#69b3a2', color='white') +
xlab("Price per night") +
xlim(c(400,3500))+
theme_bw()

# create the dataset of properties we need to map
listings_formapping_final <- listings_formapping %>%
filter(price<=3453.4, price >= 416,minimum_nights <= 4)
Data cleanings
We found that for price per night, the distribution is right-skwed, with the 10% quantile of 416.0 and the 90% quantile of 3453.4. And since we will color the properties according to its price, we filter the outliers smaller than 416 and larger than 3453.4 and make a new dataset just for the leaflet mapping.
Actual mapping
The basic map
The variables we use are price, longitude, latitude, listing_url, and property_type.
The label shows the prop_type_simplified ; the pop up shows its price per night and the corresponding url.
# load the leaflet package
library(leaflet)
# just use price per se
# add layers indicating the property type
library(RColorBrewer)
# assign color palette
# we can do colorNumeric or colorQuantile but to display price, we choose the former
pal <- colorNumeric(
palette = "PuBuGn",
domain = c(0,3500),
n = 5
)
# create a basic map
map1 <- leaflet(data =listings_formapping_final
) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
# set the initial view
setView( lat=-34, lng=18.5 , zoom=8) %>%
# add the circles
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(price),
fillOpacity = 0.4,
# popup shows price per night and url
popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
label = ~prop_type_simplified) %>%
# add legends
addLegend(pal=pal, values=~price, opacity=0.9,
title = "Price per night", position = "bottomleft")
Airbnb listings in Cape Town
The map with layers of property types
For a clearer view, we use Leaflet’s layers control feature for this new mapping. We can toggle the visibility of different property types to show by clicking on the layer control.
And the properties are colored based on its price.
# check the property type simplified (we have 5 groups)
listings_formapping_final %>%
count(prop_type_simplified) %>%
arrange(desc(n))
| prop_type_simplified | n |
|---|---|
| Apartment | 7105 |
| Other | 3648 |
| House | 3083 |
| Guest suite | 1160 |
| Villa | 227 |
# create a more complicated map with layers indicating property types
map2_base <- leaflet(data =listings_formapping_final
) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
# set the initial view
setView( lat=-34, lng=18.5 , zoom=10)
# create individual datasets for each property type
m2_apartment <- listings_formapping_final %>% filter(prop_type_simplified == "Apartment")
m2_other <- listings_formapping_final %>% filter(prop_type_simplified == "Other")
m2_house <- listings_formapping_final %>% filter(prop_type_simplified == "House")
m2_guest <- listings_formapping_final %>% filter(prop_type_simplified == "Guest suite")
m2_villa <- listings_formapping_final %>% filter(prop_type_simplified == "Villa")
# add the circles for Apartment
map2_with_layers <- map2_base %>% addCircleMarkers(data = m2_apartment,
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(price),
fillOpacity = 0.4,
# set the pop up to be listing url
# popup shows price per night and url
popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
# set the label to be property_type
label = ~property_type,
group = "Apartment") %>%
# add the circles for House
addCircleMarkers(data = m2_house,
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(price),
fillOpacity = 0.4,
# popup shows price per night and url
popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
label = ~property_type,
group = "House") %>%
# add the circles for Guest suite
addCircleMarkers(data = m2_guest,
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(price),
fillOpacity = 0.4,
# popup shows price per night and url
popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
label = ~property_type,
group = "Guest suite") %>%
# add the circles for Villa
addCircleMarkers(data = m2_villa,
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(price),
fillOpacity = 0.4,
# popup shows price per night and url
popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
label = ~property_type,
group = "Villa") %>%
# add the circles for Other
addCircleMarkers( data = m2_other,
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(price),
fillOpacity = 0.4,
# popup shows price per night and url
popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
# here we used the original property type for group other
label = ~property_type,
group = "Other") %>%
# add layer control
addLayersControl(overlayGroups = c("Apartment","House","Guest suite","Villa","Other")) %>%
# add legends
addLegend(pal=pal, values=~price, opacity=0.9,
title = "Price per night", position = "bottomleft")
Airbnb listings in Cape Town by property type
Initial Findings
Based on the map above, we observe location and price difference between various types of properties.
- For apartments, those near the coastline, for example Three Anchor Bay (belongs to Atlantic seaboard), have higher price per night.
- For houses, those located in Newlands, Bishopscoutr (belong to Southern Suburbs) are more expensive.
- For guest suites, those located in the Glen, Bakoven, and Rontree (also belong to Atlantic seaboard) cost more.
- For villas, the properties seem rather scattered around and are mostly priced above $1,500 per night.
- For properties that belong to other types, the distribution looks similar to that of apartments.
To conclude:
- In terms of location, properties near the coastlines (Atlantic seaboard) are more likely to be expensive because they have better views.
- In terms of the property type and price, houses and villas generally have higher price per night, while guest suites have lower price per night.
Regression & Model Diagnostics
Checking the underlying distribution and applying necessary transformations
We begin our regression by plotting the underlying distribution of the ‘price_4_nights’ variable to check for normality of the dependent variable price_4_nights.
ggplot(data=listings_version3, aes(x=price_4_nights))+
geom_density()+
labs(title = "Density of Price for 2 people for 4 nights at an AirBNB")+
theme_bw()

ggplot(data=listings_version3, aes(x=log(price_4_nights)))+
geom_density()+
labs(title = "Density of log transformed Price for 2 people for 4 nights at an AirBNB")+
theme_bw()

ggplot(data=listings_version3, aes(x=sqrt(price_4_nights)))+
geom_density()+
labs(title = "Density of square root transformed Price for 2 people for 4 nights at an AirBNB")+
theme_bw()

ggplot(data=listings_version3, aes(bcPower(price_4_nights,lambda=-0.25)))+
geom_density()+
labs(title = "Density of box-cox transformed Price for 2 people for 4 nights at an AirBNB")+
theme_bw()
When analyzing the underlying distribution of price for 2 people for 4 nights, we notice the underlying data is not normally distributed and running a regression will violate the normality assumption.We then attempted a number of transformations to correct this issue. First we tried a log transformation on the dependent Y (price_4_nights) which seemed to correct the distribution, while also attempting to both a square root and box cox transformation(bcPower) with lamda=-0.25, the square root did not seem to correct the non-normality, while the box-cox transformation seemed to improve the non-normality and skewness the most. For this reason we will consider 3 possible models, namely: the un-transformed ,log-transformed and box-cox transformed and discuss each of their diagnostics throughout.
library(rsample) #here we divide the data in modeling and test data
set.seed(1235)# for reproducibility set seed
model_test_split <- initial_split(listings_version3, prop=0.8) #splitting the data into a modeling set and testing set. Choosing an 80-20 split
listings_model <- training(model_test_split)#model data group
listing_test <- testing(model_test_split)#testing set
Initial model considering Property Types,Number of Reviews and Review scores as possible influential factors.
Un-transformed model
model1 <-lm(price_4_nights~prop_type_simplified+number_of_reviews+review_scores_rating, data=listings_model)
summary(model1)
##
## Call:
## lm(formula = price_4_nights ~ prop_type_simplified + number_of_reviews +
## review_scores_rating, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54873 -4150 -2312 346 1192468
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2658.44 2594.60 1.02
## prop_type_simplifiedGuest suite -2491.34 830.46 -3.00
## prop_type_simplifiedHouse 1658.67 526.61 3.15
## prop_type_simplifiedOther 633.15 531.96 1.19
## prop_type_simplifiedVilla 49620.25 1123.39 44.17
## number_of_reviews -33.38 6.63 -5.03
## review_scores_rating 45.21 27.48 1.65
## Pr(>|t|)
## (Intercept) 0.3056
## prop_type_simplifiedGuest suite 0.0027 **
## prop_type_simplifiedHouse 0.0016 **
## prop_type_simplifiedOther 0.2340
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews 0.00000049 ***
## review_scores_rating 0.0999 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25200 on 15157 degrees of freedom
## Multiple R-squared: 0.121, Adjusted R-squared: 0.121
## F-statistic: 348 on 6 and 15157 DF, p-value: <0.0000000000000002
plot(model1)




hist(rstandard(model1))

vif(model1)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.02 4 1.00
## number_of_reviews 1.03 1 1.01
## review_scores_rating 1.02 1 1.01
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 232, df = 6, p-value <0.0000000000000002
outlierTest(model1)
## rstudent
## 11271 51.3
## 7629 27.0
## 14925 25.7
## 5788 24.6
## 10099 22.8
## 11630 22.4
## 7630 21.9
## 7628 21.9
## 7448 21.9
## 10829 21.9
## unadjusted p-value
## 11271 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 7629 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000126
## 14925 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007983499999999998
## 5788 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002248499999999999268915646284
## 10099 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008514699999999998887407205483270459632269445262
## 11630 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000017653999999999997139580909755029453942195577748392
## 7630 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000583649999999999923186102589075707416687641975890849565
## 7628 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000600439999999999847585238188017319190671962500168875057
## 7448 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000606859999999999865730038734441742969341264879935444477
## 10829 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000606859999999999865730038734441742969341264879935444477
## Bonferroni p
## 11271 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 7629 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000191
## 14925 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000012105999999999995
## 5788 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003409699999999998875200300255
## 10099 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000012911999999999998245735330732217302080912452105
## 11630 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000026770999999999996184072387504995849260463826623226
## 7630 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000885039999999999876096780013780954653828676189386586889
## 7628 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000910509999999999815553244863287992384541169941610996367
## 7448 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000920249999999999885916981885468414196076770051294187062
## 10829 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000920249999999999885916981885468414196076770051294187062
Initial analysis of model1 of untransformed price_4_nights against some intuitive explanatory variables: Property Type(simplified),Number of Reviews and Review Score Rating yields a rather weak model with serious violations of underlying assumptions.
The Adjusted R squared reveals that this model only explains approximately 12% of the variability in the dependent variable. It assumes Apartment property types as a base and from the summary we can see that the property being a guest suite is a significant variable and reduces the price by 2491.337, while renting a house or villa is also significant factor and increases the price in comparison to renting an apartment by 1658.674 and 49620.250 respectively on average over 4 nights for 2 people. Other property types are not significant. Number of reviews also proved to be significant ,however, the review rating does not seem to be significant at the 5% level. It’s interesting to note that more reviews equates to a lower price (-33.38 per review)whereas the quality of reviews does not seem to have a significant effect.
Plotting the standardized residuals and Normal-QQ plot reveals the non-normality of the data, we then conducted a Breusch-Pagan test (BP test) which confirmed suspicion of Heteroscedasticity, This can be seen in the residuals vs fitted plot. These violations of underlying assumptions of multiple-regression make this model flawed and thus we will no longer consider the untransformed model.It can be noted that no particular observations was beyond 0.5 Cook’s Distance however, when running an outlier test, this revealed 10 possible outliers that we may want to consider removing to correct the underlying issues. However, as there are 15164 observations we deemed that removing these 10 outliers may have little to no effect.
Log transformed model
model1t <-lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating,data=listings_model) #running the same variables on a log transformed test
summary(model1t)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.651 -0.469 -0.078 0.400 5.563
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.839536 0.079678 98.39
## prop_type_simplifiedGuest suite -0.425330 0.025503 -16.68
## prop_type_simplifiedHouse 0.097761 0.016172 6.05
## prop_type_simplifiedOther -0.009298 0.016336 -0.57
## prop_type_simplifiedVilla 1.634476 0.034498 47.38
## number_of_reviews -0.001187 0.000204 -5.83
## review_scores_rating 0.006391 0.000844 7.57
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse 0.000000001526682 ***
## prop_type_simplifiedOther 0.57
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews 0.000000005775807 ***
## review_scores_rating 0.000000000000038 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.774 on 15157 degrees of freedom
## Multiple R-squared: 0.16, Adjusted R-squared: 0.16
## F-statistic: 481 on 6 and 15157 DF, p-value: <0.0000000000000002
plot(model1t)




hist(rstandard(model1t))

vif(model1t)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.02 4 1.00
## number_of_reviews 1.03 1 1.01
## review_scores_rating 1.02 1 1.01
bptest(model1t)
##
## studentized Breusch-Pagan test
##
## data: model1t
## BP = 767, df = 6, p-value <0.0000000000000002
Due to the non-normality issues of the underlying residuals we analyze the log transformation to the dependent variable. This seems to correct the non-normality of the residuals as shown by the increase in slope and fit of the Normal-QQ plot, and more normal looking distribution of the standard residuals histogram. We also have a higher Adjusted R squared(approx 16%) and thus conclude that the log transformed model is a better estimate of price for 4 nights. We will thus continue the analysis considering the log transformed model. However we note that the residuals vs fitted plot again reveals heteroscedasticity issues that should be corrected. The VIF test reveals no multi-collinearity issues and we once again run an outlier test to identify possible observations that could be removed in order to improve the model.
Box-Cox Model
model1_box <-lm(bcPower(price_4_nights,lambda=-0.25)~prop_type_simplified+number_of_reviews+review_scores_rating, data=listings_model) #box cox lm model on 3 variables
summary(model1_box)
##
## Call:
## lm(formula = bcPower(price_4_nights, lambda = -0.25) ~ prop_type_simplified +
## number_of_reviews + review_scores_rating, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3670 -0.0518 -0.0020 0.0542 0.3729
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.4358547 0.0090650 379.02
## prop_type_simplifiedGuest suite -0.0525631 0.0029015 -18.12
## prop_type_simplifiedHouse 0.0058936 0.0018399 3.20
## prop_type_simplifiedOther -0.0019794 0.0018586 -1.07
## prop_type_simplifiedVilla 0.1536879 0.0039249 39.16
## number_of_reviews -0.0001121 0.0000232 -4.84
## review_scores_rating 0.0007657 0.0000960 7.98
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse 0.0014 **
## prop_type_simplifiedOther 0.2869
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews 0.0000013272095799 ***
## review_scores_rating 0.0000000000000016 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.088 on 15157 degrees of freedom
## Multiple R-squared: 0.125, Adjusted R-squared: 0.124
## F-statistic: 360 on 6 and 15157 DF, p-value: <0.0000000000000002
plot(model1_box)




hist(rstandard(model1_box))

bptest(model1_box)
##
## studentized Breusch-Pagan test
##
## data: model1_box
## BP = 685, df = 6, p-value <0.0000000000000002
vif(model1_box)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.02 4 1.00
## number_of_reviews 1.03 1 1.01
## review_scores_rating 1.02 1 1.01
outlierTest(model1_box)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 11271 4.24 0.0000226 0.343
While the box-cox transformation seemed to correct the underlying non-normality the most succesfully (see QQ-plot and histogram of residuals ),this transformation yielded a rather weak Adjusted R-squared of 12.43% meaning very little variability is being captured in this model. The log transformation seems to yield a stronger result while still maintaining normality. This model also seems to reflect heteroscedasticity issues(shown by residuals vs fitted plot and failing the BP test), however we will continue this analysis using both the box-cox and log transformed model.VIF test reveals no multi-colinearity issues and we once again run an outlier test to identify possible observations that could be removed in order to improve the model.
Considering room type as a factor:
We now proceed to consider an additional dependent variable room_type (Room Type), We hypothesize this factor to have a significant effect and that room types that are generally smaller or shared are cheaper than entire homes.
Log transformation model with room type
model2_log <-lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating+room_type, data=listings_model)
summary(model2_log)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating + room_type, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.781 -0.460 -0.093 0.372 5.968
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 8.098284 0.073837 109.68
## prop_type_simplifiedGuest suite -0.281179 0.023745 -11.84
## prop_type_simplifiedHouse 0.310308 0.015535 19.98
## prop_type_simplifiedOther 0.183902 0.016143 11.39
## prop_type_simplifiedVilla 1.662733 0.031882 52.15
## number_of_reviews -0.002412 0.000190 -12.69
## review_scores_rating 0.004737 0.000781 6.07
## room_typeHotel room -0.056496 0.042113 -1.34
## room_typePrivate room -0.700470 0.014317 -48.93
## room_typeShared room -1.164642 0.069776 -16.69
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse < 0.0000000000000002 ***
## prop_type_simplifiedOther < 0.0000000000000002 ***
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews < 0.0000000000000002 ***
## review_scores_rating 0.0000000013 ***
## room_typeHotel room 0.18
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.715 on 15154 degrees of freedom
## Multiple R-squared: 0.283, Adjusted R-squared: 0.282
## F-statistic: 664 on 9 and 15154 DF, p-value: <0.0000000000000002
plot(model2_log)




hist(rstandard(model2_log))

bptest(model2_log)
##
## studentized Breusch-Pagan test
##
## data: model2_log
## BP = 389, df = 9, p-value <0.0000000000000002
vif(model2_log)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.23 4 1.03
## number_of_reviews 1.05 1 1.02
## review_scores_rating 1.02 1 1.01
## room_type 1.24 3 1.04
outlierTest(model2_log)
## rstudent unadjusted p-value Bonferroni p
## 11271 8.37 0.0000000000000000633 0.00000000000096
## 14925 8.18 0.0000000000000003111 0.00000000000472
## 5788 7.43 0.0000000000001144000 0.00000000173480
## 12609 6.69 0.0000000000235000000 0.00000035636000
## 11630 6.32 0.0000000002707100000 0.00000410500000
## 9329 5.50 0.0000000385180000000 0.00058409000000
## 10830 5.47 0.0000000448000000000 0.00067935000000
## 15017 5.33 0.0000001007200000000 0.00152730000000
## 492 5.26 0.0000001465400000000 0.00222210000000
## 4502 5.16 0.0000002488100000000 0.00377300000000
The log transformed model showed significant improvement with the addition of the room type independent variable. Adjusted R-squared significantly increased from 16% to 28%. Using room type Entire Apt and property type apartment as a base all explanatory variables are now significant at the 5 % level except for room_type Hotel. Our intercept marginally increases meaning there was an increase in average price. The addition of room types also alters the slope of the co-efficient for some of the other factors. Property types Guest suite slope reduces in magnitude while all other factors in property type appear to increase in magnitude with property type ‘Other’ becoming significant. Review score rating also becomes significant which it was not before. And while the fitted vs residuals plot also showed improvement we note that it again fails the BP test and we have not solved the heteroscedasticity issues. VIF reveals no multi-colinearity issues and we once again identify possible outliers, which we will continue to do for the remaining models.
Box cox transformation model with room type
model2_box <-lm( bcPower(price_4_nights,lambda=-0.25)~prop_type_simplified+
number_of_reviews+
review_scores_rating+
room_type, data=listings_model)
summary(model2_box)
##
## Call:
## lm(formula = bcPower(price_4_nights, lambda = -0.25) ~ prop_type_simplified +
## number_of_reviews + review_scores_rating + room_type, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3792 -0.0518 -0.0043 0.0497 0.4693
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.4678496 0.0082809 418.78
## prop_type_simplifiedGuest suite -0.0349367 0.0026630 -13.12
## prop_type_simplifiedHouse 0.0319065 0.0017422 18.31
## prop_type_simplifiedOther 0.0220273 0.0018104 12.17
## prop_type_simplifiedVilla 0.1571410 0.0035756 43.95
## number_of_reviews -0.0002632 0.0000213 -12.35
## review_scores_rating 0.0005601 0.0000876 6.40
## room_typeHotel room -0.0099548 0.0047230 -2.11
## room_typePrivate room -0.0855355 0.0016056 -53.27
## room_typeShared room -0.1534480 0.0078254 -19.61
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse < 0.0000000000000002 ***
## prop_type_simplifiedOther < 0.0000000000000002 ***
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews < 0.0000000000000002 ***
## review_scores_rating 0.00000000016 ***
## room_typeHotel room 0.035 *
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0802 on 15154 degrees of freedom
## Multiple R-squared: 0.274, Adjusted R-squared: 0.273
## F-statistic: 635 on 9 and 15154 DF, p-value: <0.0000000000000002
plot(model2_box)




hist(rstandard(model2_box))

bptest(model2_box)
##
## studentized Breusch-Pagan test
##
## data: model2_box
## BP = 286, df = 9, p-value <0.0000000000000002
vif(model2_box)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.23 4 1.03
## number_of_reviews 1.05 1 1.02
## review_scores_rating 1.02 1 1.01
## room_type 1.24 3 1.04
outlierTest(model2_box)
## rstudent unadjusted p-value Bonferroni p
## 14925 5.89 0.00000000401 0.0000608
## 11271 5.27 0.00000014067 0.0021332
## 12609 5.20 0.00000019905 0.0030184
## 5788 4.99 0.00000060215 0.0091310
## 9884 -4.73 0.00000221980 0.0336620
The box-cox transformed model also showed significant improvement with the addition of the room type independent variable. Adjusted R-squared significantly increased from ~12.5% to ~27%. Using room type Entire Apt and property type apartment as a base model all explanatory variables are now significant at the 5 % level. Just as in the log model we note that the data still reflects heteroscedasticity issues. Both the log transformed and box-cox, however, seem to have maintained normality in their residuals so we will continue assessing both models by next considering if number of bathrooms, bedrooms, beds, or size of the house (accomodates) significantly predict (price_4_nights)Price for 4 nights. We can conclude our hypothesis was correct and that room type is a significant predictor in Price for 4 nights
Considering beds,Bathrooms,Bedrooms and Size Factor:
Next we consider adding beds,bathrooms,bedrooms and size(accommodates) as we believe these quantitative factors should be able to capture more of the variability in the model. ### Log model
model3_log_BBB <-lm(log(price_4_nights)~prop_type_simplified+ #adding bed bathrooms, bedrooms and size(accoommodates)
number_of_reviews+
review_scores_rating+
room_type+bathrooms+
#bedrooms+
beds+
accommodates,
data=listings_model)
summary(model3_log_BBB)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating + room_type + bathrooms + beds + accommodates,
## data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.344 -0.372 -0.069 0.304 5.796
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.466454 0.065764 113.53
## prop_type_simplifiedGuest suite -0.231529 0.020818 -11.12
## prop_type_simplifiedHouse -0.013864 0.014466 -0.96
## prop_type_simplifiedOther 0.061177 0.014245 4.29
## prop_type_simplifiedVilla 0.726406 0.031016 23.42
## number_of_reviews -0.001795 0.000166 -10.80
## review_scores_rating 0.004683 0.000688 6.81
## room_typeHotel room 0.113999 0.036863 3.09
## room_typePrivate room -0.369377 0.013638 -27.08
## room_typeShared room -0.803357 0.061913 -12.98
## bathrooms 0.152655 0.006453 23.66
## beds -0.070300 0.004423 -15.90
## accommodates 0.163130 0.003844 42.43
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse 0.338
## prop_type_simplifiedOther 0.00001761349 ***
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews < 0.0000000000000002 ***
## review_scores_rating 0.00000000001 ***
## room_typeHotel room 0.002 **
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## beds < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.624 on 15069 degrees of freedom
## (82 observations deleted due to missingness)
## Multiple R-squared: 0.453, Adjusted R-squared: 0.453
## F-statistic: 1.04e+03 on 12 and 15069 DF, p-value: <0.0000000000000002
plot(model3_log_BBB)




hist(rstandard(model3_log_BBB))

vif(model3_log_BBB)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.67 4 1.07
## number_of_reviews 1.05 1 1.02
## review_scores_rating 1.02 1 1.01
## room_type 1.48 3 1.07
## bathrooms 2.27 1 1.51
## beds 2.71 1 1.65
## accommodates 3.04 1 1.74
bptest(model3_log_BBB)
##
## studentized Breusch-Pagan test
##
## data: model3_log_BBB
## BP = 1203, df = 12, p-value <0.0000000000000002
outlierTest(model3_log_BBB)
## rstudent unadjusted p-value Bonferroni p
## 12543 -9.58 0.00000000000000000000113 0.000000000000000017
## 11271 9.32 0.00000000000000000001327 0.000000000000000200
## 5788 9.06 0.00000000000000000014416 0.000000000000002174
## 14925 8.45 0.00000000000000003280700 0.000000000000494800
## 10091 -7.60 0.00000000000003041500000 0.000000000458720000
## 14352 -7.46 0.00000000000009162000000 0.000000001381800000
## 12609 7.42 0.00000000000012263000000 0.000000001849400000
## 10193 6.10 0.00000000107040000000000 0.000016143000000000
## 10830 5.98 0.00000000224480000000000 0.000033857000000000
## 9329 5.95 0.00000000282150000000000 0.000042553000000000
Adding Bed, Bathrooms and bedrooms and size factors to the log model seemed to dramatically improve the adjusted R-squared from ~28% to ~45%. All factors are significant at the 5% level,except for house property types having no significant effect with respect to the base (apartment property types). We continuously check the residuals plots which seem to show improvements in heteroscedasticity issues (as shown by the fitted vs residuals) we also see that listing 12543 falls outside of the 0.5 cooks distance and is an influential point that should be removed for model improvement. We also noted that bedrooms had a VIF>5 and thus showed multi-colinearity issues, we chose to remove it and run the model again which showed no multi-colinearity issues. Outliers were identified and examined.
Box-Cox model with beds,bathrooms,bedrooms and size
model3_box_BBB <-lm(bcPower(price_4_nights,lambda = -0.25)~prop_type_simplified+ #adding bed bathrooms, bedrooms and size(accoommodates)
number_of_reviews+
review_scores_rating+
room_type+bathrooms+
#bedrooms+
beds+
accommodates,
data=listings_model)
summary(model3_box_BBB)
##
## Call:
## lm(formula = bcPower(price_4_nights, lambda = -0.25) ~ prop_type_simplified +
## number_of_reviews + review_scores_rating + room_type + bathrooms +
## beds + accommodates, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5040 -0.0421 -0.0027 0.0411 0.4341
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.3989902 0.0074633 455.43
## prop_type_simplifiedGuest suite -0.0297488 0.0023626 -12.59
## prop_type_simplifiedHouse -0.0033588 0.0016416 -2.05
## prop_type_simplifiedOther 0.0086542 0.0016166 5.35
## prop_type_simplifiedVilla 0.0569981 0.0035199 16.19
## number_of_reviews -0.0001963 0.0000189 -10.40
## review_scores_rating 0.0005617 0.0000781 7.20
## room_typeHotel room 0.0086523 0.0041835 2.07
## room_typePrivate room -0.0490164 0.0015478 -31.67
## room_typeShared room -0.1158545 0.0070262 -16.49
## bathrooms 0.0141448 0.0007323 19.32
## beds -0.0066829 0.0005019 -13.32
## accommodates 0.0180558 0.0004363 41.39
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse 0.041 *
## prop_type_simplifiedOther 0.00000008765063 ***
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews < 0.0000000000000002 ***
## review_scores_rating 0.00000000000065 ***
## room_typeHotel room 0.039 *
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## beds < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0708 on 15069 degrees of freedom
## (82 observations deleted due to missingness)
## Multiple R-squared: 0.432, Adjusted R-squared: 0.432
## F-statistic: 955 on 12 and 15069 DF, p-value: <0.0000000000000002
plot(model3_box_BBB)




hist(rstandard(model3_box_BBB))

bptest(model3_box_BBB)
##
## studentized Breusch-Pagan test
##
## data: model3_box_BBB
## BP = 1063, df = 12, p-value <0.0000000000000002
vif(model3_box_BBB)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.67 4 1.07
## number_of_reviews 1.05 1 1.02
## review_scores_rating 1.02 1 1.01
## room_type 1.48 3 1.07
## bathrooms 2.27 1 1.51
## beds 2.71 1 1.65
## accommodates 3.04 1 1.74
outlierTest(model3_box_BBB)
## rstudent unadjusted p-value Bonferroni p
## 12543 -7.84 0.00000000000000476 0.0000000000717
## 10091 -7.34 0.00000000000023141 0.0000000034901
## 14352 -6.58 0.00000000004895500 0.0000007383400
## 5788 6.14 0.00000000084685000 0.0000127720000
## 14925 5.81 0.00000000630930000 0.0000951570000
## 11271 5.67 0.00000001484500000 0.0002239000000
## 12609 5.66 0.00000001560200000 0.0002353100000
## 9884 -5.64 0.00000001712500000 0.0002582700000
## 10193 5.30 0.00000012041000000 0.0018160000000
## 7687 -5.11 0.00000031792000000 0.0047949000000
Adding Bed, Bathrooms and bedrooms and size factors to the box cox model seemed to also dramatically improve the adjusted r-squared from ~27% to ~43%. All factors are significant at the 5% level with underlying equation assuming apartment property type as a base. We check the residual plots again which seems to show improvement in heteroscedasticity issues(as shown by the fitted vs residuals and Scale-Location plots) we also see again that listing 12543 falls outside of the 0.5 Cook’s distance and is an influential point that should be removed for model improvement. We also noted that bedrooms had a VIF greater than 5 and thus showed multi-collinearity issues, we therefore chose to remove it and run the model again which, this time, showed no multi-collinearity issues. Outliers were identified and examined.
Considering host effects on Price:
We aim to improve our Adjusted r-squared by considering the effects that hosts have on prices.We examine if being a ‘super host’, having a high number of other listings and having their identification verified affects prices while still including previous significant factors from the previous model.
Log model including host effects factors**
model4_log_hosts <-lm(log(price_4_nights)~prop_type_simplified+
number_of_reviews+
review_scores_rating+
room_type+bathrooms+
#bedrooms+
beds+
accommodates+host_is_superhost+ #added host effects
host_total_listings_count+
host_identity_verified,
data=listings_model)
summary(model4_log_hosts)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating + room_type + bathrooms + beds + accommodates +
## host_is_superhost + host_total_listings_count + host_identity_verified,
## data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.277 -0.376 -0.059 0.313 5.813
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.414781 0.065158 113.80
## prop_type_simplifiedGuest suite -0.204123 0.020471 -9.97
## prop_type_simplifiedHouse 0.006726 0.014203 0.47
## prop_type_simplifiedOther 0.077188 0.013970 5.53
## prop_type_simplifiedVilla 0.725077 0.030441 23.82
## number_of_reviews -0.001807 0.000176 -10.24
## review_scores_rating 0.004590 0.000685 6.70
## room_typeHotel room 0.109130 0.036094 3.02
## room_typePrivate room -0.348569 0.013433 -25.95
## room_typeShared room -0.775051 0.060643 -12.78
## bathrooms 0.150597 0.006324 23.82
## beds -0.069625 0.004333 -16.07
## accommodates 0.164751 0.003765 43.76
## host_is_superhostTRUE 0.005984 0.012789 0.47
## host_total_listings_count 0.000844 0.000034 24.80
## host_identity_verifiedTRUE 0.086557 0.011840 7.31
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse 0.6358
## prop_type_simplifiedOther 0.00000003342798 ***
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews < 0.0000000000000002 ***
## review_scores_rating 0.00000000002202 ***
## room_typeHotel room 0.0025 **
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## beds < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## host_is_superhostTRUE 0.6399
## host_total_listings_count < 0.0000000000000002 ***
## host_identity_verifiedTRUE 0.00000000000028 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.611 on 15065 degrees of freedom
## (83 observations deleted due to missingness)
## Multiple R-squared: 0.476, Adjusted R-squared: 0.475
## F-statistic: 912 on 15 and 15065 DF, p-value: <0.0000000000000002
plot(model4_log_hosts)




hist(rstandard(model4_log_hosts))

bptest(model4_log_hosts)
##
## studentized Breusch-Pagan test
##
## data: model4_log_hosts
## BP = 1467, df = 15, p-value <0.0000000000000002
vif(model4_log_hosts)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.70 4 1.07
## number_of_reviews 1.23 1 1.11
## review_scores_rating 1.06 1 1.03
## room_type 1.50 3 1.07
## bathrooms 2.27 1 1.51
## beds 2.71 1 1.65
## accommodates 3.04 1 1.74
## host_is_superhost 1.26 1 1.12
## host_total_listings_count 1.02 1 1.01
## host_identity_verified 1.06 1 1.03
outlierTest(model4_log_hosts)
## rstudent unadjusted p-value Bonferroni p
## 12543 -9.66 0.000000000000000000000496 0.00000000000000000748
## 11271 9.55 0.000000000000000000001519 0.00000000000000002291
## 5788 9.29 0.000000000000000000017380 0.00000000000000026211
## 14925 8.64 0.000000000000000006036900 0.00000000000009104200
## 10091 -7.85 0.000000000000004267600000 0.00000000006436000000
## 12609 7.58 0.000000000000035536000000 0.00000000053592000000
## 14352 -7.57 0.000000000000039691000000 0.00000000059858000000
## 10193 6.20 0.000000000570840000000000 0.00000860890000000000
## 9329 6.17 0.000000000714730000000000 0.00001077900000000000
## 10830 6.17 0.000000000716870000000000 0.00001081100000000000
The inclusion of host characteristics again improved our adjusted-R squared (but not that significantly) from 45% to 47%. However both host identity verification and hosts total listings count were found to be significant at a 5% level. Being a super host had no significant effect on pricing and will thus be removed in further models.
Box cox model including host effects factors
model4_box_hosts <-lm(bcPower(price_4_nights,lambda = -0.25)~prop_type_simplified+
number_of_reviews+
review_scores_rating+
room_type+bathrooms+
#bedrooms+
beds+
accommodates+host_is_superhost+
host_total_listings_count+
host_identity_verified,
data=listings_model)
summary(model4_box_hosts)
##
## Call:
## lm(formula = bcPower(price_4_nights, lambda = -0.25) ~ prop_type_simplified +
## number_of_reviews + review_scores_rating + room_type + bathrooms +
## beds + accommodates + host_is_superhost + host_total_listings_count +
## host_identity_verified, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5080 -0.0424 -0.0022 0.0415 0.4361
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.39349548 0.00745510 455.19
## prop_type_simplifiedGuest suite -0.02718016 0.00234219 -11.60
## prop_type_simplifiedHouse -0.00162996 0.00162508 -1.00
## prop_type_simplifiedOther 0.01003417 0.00159836 6.28
## prop_type_simplifiedVilla 0.05681804 0.00348287 16.31
## number_of_reviews -0.00019683 0.00002019 -9.75
## review_scores_rating 0.00056181 0.00007841 7.16
## room_typeHotel room 0.00820489 0.00412965 1.99
## room_typePrivate room -0.04717045 0.00153697 -30.69
## room_typeShared room -0.11315680 0.00693844 -16.31
## bathrooms 0.01393897 0.00072350 19.27
## beds -0.00660188 0.00049572 -13.32
## accommodates 0.01818881 0.00043076 42.22
## host_is_superhostTRUE -0.00057704 0.00146321 -0.39
## host_total_listings_count 0.00007397 0.00000389 19.00
## host_identity_verifiedTRUE 0.00956931 0.00135469 7.06
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse 0.316
## prop_type_simplifiedOther 0.00000000035283 ***
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews < 0.0000000000000002 ***
## review_scores_rating 0.00000000000082 ***
## room_typeHotel room 0.047 *
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## beds < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## host_is_superhostTRUE 0.693
## host_total_listings_count < 0.0000000000000002 ***
## host_identity_verifiedTRUE 0.00000000000169 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0699 on 15065 degrees of freedom
## (83 observations deleted due to missingness)
## Multiple R-squared: 0.447, Adjusted R-squared: 0.446
## F-statistic: 811 on 15 and 15065 DF, p-value: <0.0000000000000002
plot(model4_box_hosts)




hist(rstandard(model4_box_hosts))

bptest(model4_box_hosts)
##
## studentized Breusch-Pagan test
##
## data: model4_box_hosts
## BP = 1224, df = 15, p-value <0.0000000000000002
vif(model4_box_hosts)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.70 4 1.07
## number_of_reviews 1.23 1 1.11
## review_scores_rating 1.06 1 1.03
## room_type 1.50 3 1.07
## bathrooms 2.27 1 1.51
## beds 2.71 1 1.65
## accommodates 3.04 1 1.74
## host_is_superhost 1.26 1 1.12
## host_total_listings_count 1.02 1 1.01
## host_identity_verified 1.06 1 1.03
outlierTest(model4_box_hosts)
## rstudent unadjusted p-value Bonferroni p
## 12543 -7.84 0.00000000000000467 0.0000000000705
## 10091 -7.49 0.00000000000007202 0.0000000010862
## 14352 -6.62 0.00000000003811600 0.0000005748200
## 5788 6.25 0.00000000042417000 0.0000063969000
## 14925 5.90 0.00000000371120000 0.0000559690000
## 11271 5.76 0.00000000834710000 0.0001258800000
## 12609 5.74 0.00000000992930000 0.0001497400000
## 9884 -5.67 0.00000001437100000 0.0002167300000
## 10193 5.33 0.00000010059000000 0.0015170000000
## 7687 -5.18 0.00000022576000000 0.0034047000000
The inclusion of host characteristics again improved the adjusted R-squared from ~43% to ~45%.Being a super host as in the log transformed model again has no significant effect on price while host total listings and identity verification proved to be significant. At this point we also note that the box-cox transformation yielded weak adjusted r-squared values across the 4 models considered so far. We thus conclude the log transformed model as being a better transformation and continue our analysis only examining the log transformation of the dependent variable.
Examining Location Characteristics
We next choose to examine the effects that different neighborhoods(simplified) and having the location exactly as listed has on price by including the same factors as in the previous model with the exception of being a ‘super host’, as this proved insignificant.
model5_log_loc <-lm(log(price_4_nights)~prop_type_simplified+
number_of_reviews+
review_scores_rating+
room_type+bathrooms+
#bedrooms+ bedrooms had vif>5 so removed and run the model again
beds+
accommodates+
host_total_listings_count+
host_identity_verified+
is_location_exact+neighbourhood_simplified, #adding if property location is
data=listings_model)
summary(model5_log_loc)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating + room_type + bathrooms + beds + accommodates +
## host_total_listings_count + host_identity_verified + is_location_exact +
## neighbourhood_simplified, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.859 -0.369 -0.053 0.314 5.772
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.5432855 0.0638998 118.05
## prop_type_simplifiedGuest suite -0.1523122 0.0202181 -7.53
## prop_type_simplifiedHouse 0.0297247 0.0140207 2.12
## prop_type_simplifiedOther 0.0866387 0.0137265 6.31
## prop_type_simplifiedVilla 0.7033002 0.0298822 23.54
## number_of_reviews -0.0023230 0.0001642 -14.15
## review_scores_rating 0.0041759 0.0006614 6.31
## room_typeHotel room 0.1546093 0.0355398 4.35
## room_typePrivate room -0.3152351 0.0132095 -23.86
## room_typeShared room -0.7750745 0.0594558 -13.04
## bathrooms 0.1374114 0.0062223 22.08
## beds -0.0632894 0.0042559 -14.87
## accommodates 0.1677926 0.0036935 45.43
## host_total_listings_count 0.0007920 0.0000334 23.70
## host_identity_verifiedTRUE 0.0551713 0.0116436 4.74
## is_location_exactTRUE -0.0215715 0.0123723 -1.74
## neighbourhood_simplifiedCape Flats -0.2093225 0.0253631 -8.25
## neighbourhood_simplifiedCBD -0.2676851 0.0169095 -15.83
## neighbourhood_simplifiedOuter Suburbs -0.2538057 0.0174141 -14.57
## neighbourhood_simplifiedSommerset -0.2348116 0.0265401 -8.85
## neighbourhood_simplifiedSouthern Suburbs -0.5363686 0.0370788 -14.47
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite 0.000000000000052 ***
## prop_type_simplifiedHouse 0.034 *
## prop_type_simplifiedOther 0.000000000283534 ***
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews < 0.0000000000000002 ***
## review_scores_rating 0.000000000279652 ***
## room_typeHotel room 0.000013683949688 ***
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## beds < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## host_total_listings_count < 0.0000000000000002 ***
## host_identity_verifiedTRUE 0.000002174584338 ***
## is_location_exactTRUE 0.081 .
## neighbourhood_simplifiedCape Flats < 0.0000000000000002 ***
## neighbourhood_simplifiedCBD < 0.0000000000000002 ***
## neighbourhood_simplifiedOuter Suburbs < 0.0000000000000002 ***
## neighbourhood_simplifiedSommerset < 0.0000000000000002 ***
## neighbourhood_simplifiedSouthern Suburbs < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.599 on 15060 degrees of freedom
## (83 observations deleted due to missingness)
## Multiple R-squared: 0.497, Adjusted R-squared: 0.496
## F-statistic: 743 on 20 and 15060 DF, p-value: <0.0000000000000002
plot(model5_log_loc)




hist(rstandard(model5_log_loc))

vif(model5_log_loc) #check for multi-colinearity issues
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.75 4 1.07
## number_of_reviews 1.11 1 1.05
## review_scores_rating 1.02 1 1.01
## room_type 1.52 3 1.07
## bathrooms 2.29 1 1.51
## beds 2.72 1 1.65
## accommodates 3.05 1 1.75
## host_total_listings_count 1.02 1 1.01
## host_identity_verified 1.06 1 1.03
## is_location_exact 1.03 1 1.01
## neighbourhood_simplified 1.14 5 1.01
bptest(model5_log_loc)#check for normality
##
## studentized Breusch-Pagan test
##
## data: model5_log_loc
## BP = 1348, df = 20, p-value <0.0000000000000002
outlierTest(model5_log_loc)#
## rstudent unadjusted p-value Bonferroni p
## 5788 9.68 0.00000000000000000000042 0.00000000000000000633
## 11271 9.49 0.00000000000000000000263 0.00000000000000003969
## 12543 -9.08 0.00000000000000000011774 0.00000000000000177560
## 14925 8.68 0.00000000000000000419920 0.00000000000006332900
## 10091 -8.07 0.00000000000000073131000 0.00000000001102900000
## 12609 7.80 0.00000000000000636450000 0.00000000009598300000
## 14352 -6.77 0.00000000001290600000000 0.00000019463000000000
## 10830 6.60 0.00000000004262500000000 0.00000064283000000000
## 9160 6.50 0.00000000008405800000000 0.00000126770000000000
## 9329 6.24 0.00000000044699000000000 0.00000674110000000000
The inclusion of location factors again shows improvement in the model(adj r-squared increased from ~48% to ~50%). The only insignificant factor was having an exact location, while the addition of neighborhoods proved significant. The model assumes the Atlantic seaboard neighborhood as its base factor and still assumes apartment as base room type. The residuals vs fitted plot again shows improvement in the heteroscedasticity issues, however this model still fails the BP test. Again listing 12543 shows to be influential and should be removed. We examine the plot of residuals which still seems to maintain a normal shape and VIF examination reveals no multi-collinearity issues.
Adding the effects of having a Cancellation policy and having to pay a Security deposit
We then moved on to examine the effect that paying cancellation policies and security deposits had on Price for 4 nights.We add security_deposit and cancellation_policy to our previous model and removed is_location_ exact as it proved to be an insignificant factor in the previous model.
model6_log_policy <-lm(log(price_4_nights)~
prop_type_simplified+
number_of_reviews+
review_scores_rating+
room_type+
bathrooms+
#bedrooms removed as vif>5
beds+
accommodates+
host_total_listings_count+
host_identity_verified+
neighbourhood_simplified+
security_deposit+
cancellation_policy
, #adding if property location is
data=listings_model)
summary(model6_log_policy)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating + room_type + bathrooms + beds + accommodates +
## host_total_listings_count + host_identity_verified + neighbourhood_simplified +
## security_deposit + cancellation_policy, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.055 -0.340 -0.045 0.301 5.596
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.48292885 0.07350338 101.80
## prop_type_simplifiedGuest suite -0.16001048 0.02399096 -6.67
## prop_type_simplifiedHouse 0.03306866 0.01659402 1.99
## prop_type_simplifiedOther 0.05112473 0.01566570 3.26
## prop_type_simplifiedVilla 0.42811617 0.03279836 13.05
## number_of_reviews -0.00193359 0.00018346 -10.54
## review_scores_rating 0.00439598 0.00076438 5.75
## room_typeHotel room 0.13431339 0.03655813 3.67
## room_typePrivate room -0.26830328 0.01617991 -16.58
## room_typeShared room -0.53863805 0.09665737 -5.57
## bathrooms 0.17475525 0.00864045 20.23
## beds -0.03764611 0.00512684 -7.34
## accommodates 0.11787335 0.00486724 24.22
## host_total_listings_count 0.00081838 0.00003165 25.86
## host_identity_verifiedTRUE 0.00282839 0.01303958 0.22
## neighbourhood_simplifiedCape Flats -0.19995945 0.03170062 -6.31
## neighbourhood_simplifiedCBD -0.24967636 0.02060003 -12.12
## neighbourhood_simplifiedOuter Suburbs -0.23735774 0.02055081 -11.55
## neighbourhood_simplifiedSommerset -0.20533877 0.03259501 -6.30
## neighbourhood_simplifiedSouthern Suburbs -0.61011373 0.04895935 -12.46
## security_deposit 0.00002518 0.00000119 21.15
## cancellation_policyluxury_moderate 1.57613024 0.09220257 17.09
## cancellation_policyluxury_super_strict_95 2.03987601 0.22737962 8.97
## cancellation_policymoderate -0.03145849 0.01586924 -1.98
## cancellation_policystrict_14_with_grace_period 0.13633584 0.01440133 9.47
## cancellation_policysuper_strict_30 0.59608182 0.03460200 17.23
## cancellation_policysuper_strict_60 -0.22029215 0.10537916 -2.09
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite 0.00000000002703 ***
## prop_type_simplifiedHouse 0.04631 *
## prop_type_simplifiedOther 0.00110 **
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews < 0.0000000000000002 ***
## review_scores_rating 0.00000000913457 ***
## room_typeHotel room 0.00024 ***
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room 0.00000002575127 ***
## bathrooms < 0.0000000000000002 ***
## beds 0.00000000000023 ***
## accommodates < 0.0000000000000002 ***
## host_total_listings_count < 0.0000000000000002 ***
## host_identity_verifiedTRUE 0.82828
## neighbourhood_simplifiedCape Flats 0.00000000029534 ***
## neighbourhood_simplifiedCBD < 0.0000000000000002 ***
## neighbourhood_simplifiedOuter Suburbs < 0.0000000000000002 ***
## neighbourhood_simplifiedSommerset 0.00000000031101 ***
## neighbourhood_simplifiedSouthern Suburbs < 0.0000000000000002 ***
## security_deposit < 0.0000000000000002 ***
## cancellation_policyluxury_moderate < 0.0000000000000002 ***
## cancellation_policyluxury_super_strict_95 < 0.0000000000000002 ***
## cancellation_policymoderate 0.04747 *
## cancellation_policystrict_14_with_grace_period < 0.0000000000000002 ***
## cancellation_policysuper_strict_30 < 0.0000000000000002 ***
## cancellation_policysuper_strict_60 0.03660 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.551 on 9818 degrees of freedom
## (5319 observations deleted due to missingness)
## Multiple R-squared: 0.576, Adjusted R-squared: 0.575
## F-statistic: 513 on 26 and 9818 DF, p-value: <0.0000000000000002
plot(model6_log_policy)




hist(rstandard(model6_log_policy))

bptest(model6_log_policy)
##
## studentized Breusch-Pagan test
##
## data: model6_log_policy
## BP = 578, df = 26, p-value <0.0000000000000002
vif(model6_log_policy)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 2.09 4 1.10
## number_of_reviews 1.14 1 1.07
## review_scores_rating 1.02 1 1.01
## room_type 1.54 3 1.07
## bathrooms 2.86 1 1.69
## beds 2.80 1 1.67
## accommodates 4.33 1 2.08
## host_total_listings_count 1.03 1 1.02
## host_identity_verified 1.07 1 1.04
## neighbourhood_simplified 1.16 5 1.02
## security_deposit 1.18 1 1.09
## cancellation_policy 1.25 6 1.02
outlierTest(model6_log_policy)
## rstudent unadjusted p-value Bonferroni p
## 11271 10.21 0.0000000000000000000000024 0.0000000000000000000237
## 5788 8.48 0.0000000000000000256780000 0.0000000000002528000000
## 14925 8.46 0.0000000000000000305330000 0.0000000000003005900000
## 10830 7.26 0.0000000000004286400000000 0.0000000042200000000000
## 14641 6.75 0.0000000000157050000000000 0.0000001546200000000000
## 9329 6.65 0.0000000000302060000000000 0.0000002973700000000000
## 11578 5.75 0.0000000093548000000000008 0.0000920979999999999981
## 13345 -5.58 0.0000000242190000000000000 0.0002384400000000000007
## 10091 -5.57 0.0000000258959999999999991 0.0002549399999999999952
## 11579 5.39 0.0000000736369999999999996 0.0007249599999999999708
The inclusion of a cancellation policy and security deposit both seemed to be significant in determining the price as they both have significant p values. The Adjusted R-squared also improved to ~58% (from a previous 50%). This seems to be a fairly good model for predicting price for 4 nights. We examine the residuals and notice great improvement in meeting the underlying assumptions of Multiple- Regression. The fitted vs residual plot is flat and shows a more random nature in comparison to that of our initial model, however, the small p value of the BP test indicates the null hypothesis of homoscedasticity is rejected and the model still has heteroscedasticity issues. The examination of residuals histogram again reveals a far more normal underlying distribution to what we initially started with and that the log transformation was effective in correcting for non-normality. VIF reveals no multi-collinearity issues and the outlier test again reveals possible observations that should be removed for model improvement.
Trying to estimate a better model
While the previous model did prove to be promising, we elected to run an even more extensive model examining all possible factors that we believed could be significant on price. This included factors we had considered previously as well as the introduction of more specific factors such as reviews for value. The original factors were as follows :prop_type_simplified,review_scores_rating,room_type,bathrooms,bedrooms,beds,accommodation, host_total_listings_count, host_identity_verified, host_response_rate, neighborhood_simplified,security_deposit, cancellation_policy,instant_bookable, review_scores_cleanliness, review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value
price_4_nights_full <-lm(log(price_4_nights)~prop_type_simplified+ #examine effects of security deposit and cancellation policy
number_of_reviews+
review_scores_rating+
room_type+
bathrooms+
#bedrooms+ vif > 5 so removed
beds+
#accommodates+ vif>5
#host_total_listings_count+ #removed if location is exact as not significant
#host_identity_verified+
#host_response_rate+
neighbourhood_simplified+
security_deposit+
cancellation_policy+
#instant_bookable+
#review_scores_cleanliness+ #proved to be insignificant when run
#review_scores_checkin+
#review_scores_communication+
review_scores_location+
review_scores_value
,
data=listings_model)
summary(price_4_nights_full)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating + room_type + bathrooms + beds + neighbourhood_simplified +
## security_deposit + cancellation_policy + review_scores_location +
## review_scores_value, data = listings_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.287 -0.307 -0.035 0.285 3.885
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.34841258 0.08988197 81.76
## prop_type_simplifiedGuest suite -0.18620047 0.02373552 -7.84
## prop_type_simplifiedHouse 0.08167630 0.01705393 4.79
## prop_type_simplifiedOther 0.01841417 0.01629029 1.13
## prop_type_simplifiedVilla 0.39508295 0.03425863 11.53
## number_of_reviews -0.00127227 0.00017436 -7.30
## review_scores_rating 0.00965609 0.00107531 8.98
## room_typeHotel room 0.08205166 0.04640275 1.77
## room_typePrivate room -0.43907866 0.01791673 -24.51
## room_typeShared room -0.89066226 0.12707778 -7.01
## bathrooms 0.27686197 0.00839275 32.99
## beds 0.04882950 0.00491038 9.94
## neighbourhood_simplifiedCape Flats -0.23610402 0.03528431 -6.69
## neighbourhood_simplifiedCBD -0.24505076 0.02281454 -10.74
## neighbourhood_simplifiedOuter Suburbs -0.20660367 0.02185832 -9.45
## neighbourhood_simplifiedSommerset -0.17634329 0.03584058 -4.92
## neighbourhood_simplifiedSouthern Suburbs -0.37212657 0.06224877 -5.98
## security_deposit 0.00002603 0.00000142 18.38
## cancellation_policymoderate 0.03118345 0.01619578 1.93
## cancellation_policystrict_14_with_grace_period 0.12999877 0.01519273 8.56
## cancellation_policysuper_strict_30 0.53131804 0.04957303 10.72
## cancellation_policysuper_strict_60 -0.26249969 0.11155302 -2.35
## review_scores_location 0.08040592 0.01020453 7.88
## review_scores_value -0.11625998 0.00992696 -11.71
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite 0.0000000000000049 ***
## prop_type_simplifiedHouse 0.0000017048391570 ***
## prop_type_simplifiedOther 0.258
## prop_type_simplifiedVilla < 0.0000000000000002 ***
## number_of_reviews 0.0000000000003241 ***
## review_scores_rating < 0.0000000000000002 ***
## room_typeHotel room 0.077 .
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room 0.0000000000026063 ***
## bathrooms < 0.0000000000000002 ***
## beds < 0.0000000000000002 ***
## neighbourhood_simplifiedCape Flats 0.0000000000236357 ***
## neighbourhood_simplifiedCBD < 0.0000000000000002 ***
## neighbourhood_simplifiedOuter Suburbs < 0.0000000000000002 ***
## neighbourhood_simplifiedSommerset 0.0000008823689233 ***
## neighbourhood_simplifiedSouthern Suburbs 0.0000000023585091 ***
## security_deposit < 0.0000000000000002 ***
## cancellation_policymoderate 0.054 .
## cancellation_policystrict_14_with_grace_period < 0.0000000000000002 ***
## cancellation_policysuper_strict_30 < 0.0000000000000002 ***
## cancellation_policysuper_strict_60 0.019 *
## review_scores_location 0.0000000000000037 ***
## review_scores_value < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.506 on 7727 degrees of freedom
## (7413 observations deleted due to missingness)
## Multiple R-squared: 0.552, Adjusted R-squared: 0.551
## F-statistic: 414 on 23 and 7727 DF, p-value: <0.0000000000000002
plot(price_4_nights_full)




hist(rstandard(price_4_nights_full))

vif(price_4_nights_full)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.81 4 1.08
## number_of_reviews 1.08 1 1.04
## review_scores_rating 2.38 1 1.54
## room_type 1.37 3 1.05
## bathrooms 2.31 1 1.52
## beds 2.08 1 1.44
## neighbourhood_simplified 1.13 5 1.01
## security_deposit 1.17 1 1.08
## cancellation_policy 1.09 4 1.01
## review_scores_location 1.46 1 1.21
## review_scores_value 2.40 1 1.55
outlierTest(price_4_nights_full)
## rstudent unadjusted p-value Bonferroni p
## 5212 -8.75 0.00000000000000000268 0.0000000000000207
## 10830 7.73 0.00000000000001223700 0.0000000000948480
## 9329 7.13 0.00000000000112090000 0.0000000086878000
## 5009 -6.30 0.00000000031892000000 0.0000024720000000
## 10725 -6.03 0.00000000175070000000 0.0000135690000000
## 8231 5.54 0.00000003047600000000 0.0002362200000000
## 11376 -5.45 0.00000005210700000000 0.0004038800000000
## 7588 -5.21 0.00000018891000000000 0.0014642000000000
## 14701 -5.04 0.00000048433000000000 0.0037541000000000
## 14868 5.04 0.00000048494000000000 0.0037588000000000
# check normality of residuals
Running this model revealed a very good ajusted R-squared of ~60%, however, we noted multi-collinearity issues with accommodation(size) and bedrooms both having a vif>5, we thus removed them and ran the model again. We noted that review scores for cleanliness, communication and check in were not significant and that host related factors(host_total_listings_count,host_identity_verified,host_response_rate) and being instantly bookable were not significant either, so we removed these factors and run the regression again. This reduced the value of the Adjusted R-squared to ~55% which is less than the prior model.
Comparison of models
library(huxtable)
kbl(huxreg(model1,model1t,model1_box,model2_log,model2_box,model3_log_BBB,model3_box_BBB,model4_log_hosts,
model4_box_hosts,model5_log_loc,model6_log_policy,price_4_nights_full))
| names | model1 | model2 | model3 | model4 | model5 | model6 | model7 | model8 | model9 | model10 | model11 | model12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
| 1 | (Intercept) | 2658.4405953067 | 7.83953624691064 *** | 3.43585468456551 *** | 8.098283963137 *** | 3.46784955276225 *** | 7.46645420635465 *** | 3.39899017641452 *** | 7.41478139492102 *** | 3.39349548073157 *** | 7.54328548954538 *** | 7.48292884644085 *** | 7.34841258365636 *** |
| 2 | (2594.60195797835) | (0.0796778048139288) | (0.00906504154132443) | (0.0738372371605012) | (0.00828087401406495) | (0.0657644515527516) | (0.00746331015229588) | (0.0651583705520031) | (0.00745509722289963) | (0.0638998099374701) | (0.0735033822899816) | (0.0898819735320436) | |
| 3 | prop_type_simplifiedGuest suite | -2491.33724008497 ** | -0.425330419773082 *** | -0.0525630774751917 *** | -0.281179331346766 *** | -0.0349367202661454 *** | -0.231529190040081 *** | -0.0297487566697921 *** | -0.204122563152797 *** | -0.0271801584422914 *** | -0.152312181187604 *** | -0.160010481333662 *** | -0.18620046810099 *** |
| 4 | (830.46106947796) | (0.0255026844468232) | (0.00290147167665601) | (0.0237449814428094) | (0.00266300863027687) | (0.0208183476601531) | (0.00236258011399085) | (0.0204710137076261) | (0.00234219174219315) | (0.0202180582677605) | (0.0239909593846555) | (0.0237355219849422) | |
| 5 | prop_type_simplifiedHouse | 1658.67400770658 ** | 0.0977612146241578 *** | 0.00589362891853956 ** | 0.310307758716861 *** | 0.0319064666225461 *** | -0.0138638733042324 | -0.00335880430850463 * | 0.00672580987886029 | -0.00162995600799047 | 0.0297247256952743 * | 0.0330686628492796 * | 0.0816762951971237 *** |
| 6 | (526.605032195879) | (0.0161715491042137) | (0.00183985696844447) | (0.0155347540969922) | (0.00174222853486571) | (0.0144656908920247) | (0.00164164583061751) | (0.0142033892162826) | (0.00162508126899156) | (0.0140206782814926) | (0.0165940169470332) | (0.0170539268889209) | |
| 7 | prop_type_simplifiedOther | 633.147275812926 | -0.00929837023603162 | -0.00197944730983314 | 0.183902464626499 *** | 0.0220272921858152 *** | 0.0611767355421061 *** | 0.00865417978137493 *** | 0.0771883231865824 *** | 0.010034169176559 *** | 0.0866387479649601 *** | 0.0511247258787441 ** | 0.0184141713900181 |
| 8 | (531.962503667527) | (0.0163360720534474) | (0.00185857494608941) | (0.0161425207974367) | (0.00181038979969453) | (0.014245166557147) | (0.00161661952128988) | (0.0139698495284163) | (0.00159836081751772) | (0.0137264783020154) | (0.0156656982228964) | (0.0162902861231909) | |
| 9 | prop_type_simplifiedVilla | 49620.2503668972 *** | 1.63447552800511 *** | 0.153687864099452 *** | 1.66273257694416 *** | 0.157140997627402 *** | 0.726406004461657 *** | 0.0569980603232139 *** | 0.725076509688456 *** | 0.0568180410646431 *** | 0.703300186882409 *** | 0.428116172310518 *** | 0.395082950885861 *** |
| 10 | (1123.39373747912) | (0.0344983733126429) | (0.00392492222793463) | (0.0318818895411535) | (0.00357556594441291) | (0.0310163985133541) | (0.00351991077925571) | (0.0304406778403017) | (0.00348287120914565) | (0.0298822267230871) | (0.0327983645531646) | (0.0342586295903378) | |
| 11 | number_of_reviews | -33.3801167771904 *** | -0.00118673633535017 *** | -0.000112100837401775 *** | -0.00241185097492915 *** | -0.000263233976907366 *** | -0.00179495350681289 *** | -0.000196252018407267 *** | -0.00180694439679867 *** | -0.000196831401711996 *** | -0.00232295023785362 *** | -0.00193358954575242 *** | -0.0012722745264221 *** |
| 12 | (6.63256562374388) | (0.000203679900710476) | (0.0000231729120221763) | (0.000190071725502894) | (0.0000213166157490977) | (0.000166276092029035) | (0.0000188699216130307) | (0.000176435586496053) | (0.0000201868837382546) | (0.000164156457193156) | (0.000183458044900662) | (0.000174362305990784) | |
| 13 | review_scores_rating | 45.2149276439987 | 0.00639120522839432 *** | 0.000765672155446871 *** | 0.00473701323057481 *** | 0.000560090845174015 *** | 0.00468252128793722 *** | 0.000561700199474217 *** | 0.00458976383267998 *** | 0.000561808811247495 *** | 0.00417586503762622 *** | 0.00439597989279097 *** | 0.00965608819757964 *** |
| 14 | (27.4792037337613) | (0.000843860702720971) | (0.0000960070667499058) | (0.00078071084825284) | (0.0000875570162754403) | (0.000687845936668773) | (0.0000780605242064202) | (0.00068533851631602) | (0.0000784130300750231) | (0.000661372787816754) | (0.000764379578568218) | (0.0010753050494551) | |
| 15 | room_typeHotel room | -0.0564963545424016 | -0.00995479369396976 * | 0.11399932423255 ** | 0.00865225364839856 * | 0.109129934808675 ** | 0.00820488948188399 * | 0.154609339613853 *** | 0.134313386970929 *** | 0.0820516572689082 | |||
| 16 | (0.0421129564893995) | (0.00472298396391067) | (0.0368634086461288) | (0.00418346151303642) | (0.0360936213485086) | (0.00412965293637757) | (0.0355398438394477) | (0.0365581272724676) | (0.0464027461451645) | ||||
| 17 | room_typePrivate room | -0.700469980255449 *** | -0.0855355450364339 *** | -0.369377156554424 *** | -0.0490163544616129 *** | -0.348569124958902 *** | -0.0471704478853915 *** | -0.315235118254419 *** | -0.268303284784049 *** | -0.439078661524376 *** | |||
| 18 | (0.0143165311649329) | (0.00160560437327229) | (0.0136383919977057) | (0.00154775942099694) | (0.0134332580494336) | (0.00153696668486975) | (0.0132094856694054) | (0.0161799097702766) | (0.0179167309673096) | ||||
| 19 | room_typeShared room | -1.16464217584111 *** | -0.153447959401989 *** | -0.803356553518357 *** | -0.115854498439462 *** | -0.775050740958292 *** | -0.113156799767381 *** | -0.775074523559198 *** | -0.538638045556679 *** | -0.890662262466139 *** | |||
| 20 | (0.069775713737354) | (0.00782537262932127) | (0.0619125618518909) | (0.00702617661234841) | (0.0606427605205365) | (0.00693844354479059) | (0.0594558034128224) | (0.0966573653812291) | (0.127077775104405) | ||||
| 21 | bathrooms | 0.152654748561735 *** | 0.0141447554016876 *** | 0.150596744376025 *** | 0.0139389724371136 *** | 0.137411391075311 *** | 0.174755245747438 *** | 0.276861974148033 *** | |||||
| 22 | (0.00645252729903706) | (0.000732268138209002) | (0.0063234736344457) | (0.000723500454843494) | (0.00622225881717983) | (0.00864045001425425) | (0.00839275495171724) | ||||||
| 23 | beds | -0.0702999384735502 *** | -0.0066828848743994 *** | -0.0696247715340496 *** | -0.00660187817408571 *** | -0.0632893933434176 *** | -0.0376461085107749 *** | 0.0488294961970899 *** | |||||
| 24 | (0.00442263834245379) | (0.000501905221770033) | (0.00433266872204997) | (0.000495722441857562) | (0.00425594926531817) | (0.00512683529428104) | (0.00491038345493) | ||||||
| 25 | accommodates | 0.163129874859234 *** | 0.018055752323278 *** | 0.164750933980334 *** | 0.0181888077785848 *** | 0.167792592105737 *** | 0.117873350480238 *** | ||||||
| 26 | (0.0038443260839278) | (0.000436275179272194) | (0.00376489298847916) | (0.000430760361641094) | (0.00369346793576996) | (0.00486723541125897) | |||||||
| 27 | host_is_superhostTRUE | 0.00598361768086492 | -0.000577043114074525 | ||||||||||
| 28 | (0.0127886223848374) | (0.00146321067298363) | |||||||||||
| 29 | host_total_listings_count | 0.000844020509133167 *** | 0.0000739708223228305 *** | 0.000791974145422668 *** | 0.0008183823076365 *** | ||||||||
| 30 | (0.0000340342536170638) | (0.00000389403030607606) | (0.0000334196257433734) | (0.0000316478130844066) | |||||||||
| 31 | host_identity_verifiedTRUE | 0.086556763268096 *** | 0.00956931116890247 *** | 0.055171264816713 *** | 0.00282839290581888 | ||||||||
| 32 | (0.0118401432052085) | (0.00135469039480408) | (0.0116436177552824) | (0.0130395755337533) | |||||||||
| 33 | is_location_exactTRUE | -0.0215715225897326 | |||||||||||
| 34 | (0.0123723157834674) | ||||||||||||
| 35 | neighbourhood_simplifiedCape Flats | -0.20932251474412 *** | -0.199959446561218 *** | -0.23610402431263 *** | |||||||||
| 36 | (0.0253630523138711) | (0.0317006245466154) | (0.0352843100357869) | ||||||||||
| 37 | neighbourhood_simplifiedCBD | -0.267685079246393 *** | -0.249676361299245 *** | -0.245050758564061 *** | |||||||||
| 38 | (0.0169095412834666) | (0.0206000256489285) | (0.0228145449511962) | ||||||||||
| 39 | neighbourhood_simplifiedOuter Suburbs | -0.253805711866448 *** | -0.237357737864287 *** | -0.206603670659701 *** | |||||||||
| 40 | (0.0174141080985606) | (0.0205508061534903) | (0.0218583221468348) | ||||||||||
| 41 | neighbourhood_simplifiedSommerset | -0.234811556815801 *** | -0.205338769971492 *** | -0.176343289916416 *** | |||||||||
| 42 | (0.0265401221782711) | (0.0325950141372058) | (0.0358405755289053) | ||||||||||
| 43 | neighbourhood_simplifiedSouthern Suburbs | -0.536368612330183 *** | -0.610113731173444 *** | -0.372126566839667 *** | |||||||||
| 44 | (0.037078791529449) | (0.0489593522993336) | (0.0622487661123216) | ||||||||||
| 45 | security_deposit | 0.0000251828605729947 *** | 0.0000260284807343289 *** | ||||||||||
| 46 | (0.00000119095316353771) | (0.00000141627288372506) | |||||||||||
| 47 | cancellation_policyluxury_moderate | 1.57613024491602 *** | |||||||||||
| 48 | (0.0922025677212688) | ||||||||||||
| 49 | cancellation_policyluxury_super_strict_95 | 2.03987601101046 *** | |||||||||||
| 50 | (0.227379624341317) | ||||||||||||
| 51 | cancellation_policymoderate | -0.0314584877915848 * | 0.0311834514222431 | ||||||||||
| 52 | (0.0158692429990555) | (0.0161957795060771) | |||||||||||
| 53 | cancellation_policystrict_14_with_grace_period | 0.136335843026511 *** | 0.129998774372139 *** | ||||||||||
| 54 | (0.0144013272748752) | (0.0151927340619825) | |||||||||||
| 55 | cancellation_policysuper_strict_30 | 0.596081818069027 *** | 0.531318037554978 *** | ||||||||||
| 56 | (0.0346020027403708) | (0.0495730321868159) | |||||||||||
| 57 | cancellation_policysuper_strict_60 | -0.220292151745535 * | -0.262499688177884 * | ||||||||||
| 58 | (0.105379155889676) | (0.111553020887095) | |||||||||||
| 59 | review_scores_location | 0.0804059206785202 *** | |||||||||||
| 60 | (0.0102045324441997) | ||||||||||||
| 61 | review_scores_value | -0.116259979160915 *** | |||||||||||
| 62 | (0.0099269609371801) | ||||||||||||
| 1.1 | N | 15164 | 15164 | 15164 | 15164 | 15164 | 15082 | 15082 | 15081 | 15081 | 15081 | 9845 | 7751 |
| 2.1 | R2 | 0.120999699699964 | 0.159838544045043 | 0.124647365151497 | 0.282813121926888 | 0.27391205376706 | 0.45302849453 | 0.432011276527834 | 0.475877032312182 | 0.446782403347269 | 0.496519241671087 | 0.576083444143207 | 0.552239556605096 |
| 3.1 | logLik | -175191.706979002 | -17623.301572395 | 15336.6477329464 | -16423.3921662058 | 16754.1456295263 | -14279.7539953668 | 18539.8906945251 | -13957.5184613583 | 18736.925091094 | -13654.5348950004 | -8096.00916312769 | -5703.4576481468 |
| 4.1 | AIC | 350399.413958003 | 35262.6031447901 | -30657.2954658927 | 32868.7843324115 | -33486.2912590526 | 28587.5079907336 | -37051.7813890502 | 27949.0369227166 | -37439.850182188 | 27353.0697900008 | 16248.0183262554 | 11456.9152962936 |
| .1 | *** p < 0.001; ** p < 0.01; * p < 0.05. |
#huxreg fits a maximum of 8 observations at a time so we ran two separate ones to compare all models analyzed
#price_4_nights_full was best at 0.58 A-R^2
Comparing all models we see that the highest R squared is given by our second-last model (model6_log_policy), which reads as follows:log(price_4_nights) = 7.483 -1.600e-01 prop_type_simplifiedGuest suite + 3.307e-02 prop_type_simplifiedHouse + 5.112e-02 rop_type_simplifiedOther + 4.281e-01 prop_type_simplifiedVilla -1.934e-03 number_of_reviews + 4.396e-03review_scores_rating + 1.343e-01 room_typeHotel room -2.683e-01 room_typePrivate room -5.386e-01room_typeShared room + 1.748e-01 bathrooms -3.765e-02 beds + 1.179e-01 accommodates + 8.184e-04 host_total_listings_count -2.000e-01 neighbourhood_simplifiedCape Flats -2.497e-01 neighbourhood_simplifiedCBD -2.374e-01 neighbourhood_simplifiedOuter Suburbs -2.053e-01 neighbourhood_simplifiedSommerset -6.101e-01 neighbourhood_simplifiedSouthern Suburbs + 2.518e-05 security_deposit + 1.576e+00 cancellation_policyluxury_moderate + 2.040e+00 cancellation_policyluxury_super_strict_95 -3.146e-02 cancellation_policymoderate + 1.363e-01 cancellation_policystrict_14_with_grace_period + 5.961e-01 cancellation_policysuper_strict_30 -2.203e-01 cancellation_policysuper_strict_60+ error
Where the model assumes apartment property types, Entire home/apt property type, Atlantic Seaboard neighborhood, and no(N/A)cancellation policy as its base hence all estimates of factor slope either correlated to increases or decreases on these base assumptions.
We proceed to run an ANOVA test and find the confidence intervals for our estimated model.The ANOVA reveals that there are significant differences between means of the groups as the p value is less than 0.05
library(kableExtra)
kbl(anova(model6_log_policy))
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| prop_type_simplified | 4 | 1425.49 | 356.374 | 1171.9 | 0 |
| number_of_reviews | 1 | 22.38 | 22.383 | 73.6 | 0 |
| review_scores_rating | 1 | 19.17 | 19.172 | 63.0 | 0 |
| room_type | 3 | 592.17 | 197.391 | 649.1 | 0 |
| bathrooms | 1 | 1044.84 | 1044.844 | 3436.0 | 0 |
| beds | 1 | 4.26 | 4.261 | 14.0 | 0 |
| accommodates | 1 | 188.46 | 188.462 | 619.8 | 0 |
| host_total_listings_count | 1 | 219.63 | 219.631 | 722.3 | 0 |
| host_identity_verified | 1 | 8.63 | 8.626 | 28.4 | 0 |
| neighbourhood_simplified | 5 | 145.49 | 29.097 | 95.7 | 0 |
| security_deposit | 1 | 157.07 | 157.070 | 516.5 | 0 |
| cancellation_policy | 6 | 229.61 | 38.269 | 125.8 | 0 |
| Residuals | 9818 | 2985.54 | 0.304 | NA | NA |
kbl(confint(model6_log_policy))
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 7.339 | 7.627 |
| prop_type_simplifiedGuest suite | -0.207 | -0.113 |
| prop_type_simplifiedHouse | 0.001 | 0.066 |
| prop_type_simplifiedOther | 0.020 | 0.082 |
| prop_type_simplifiedVilla | 0.364 | 0.492 |
| number_of_reviews | -0.002 | -0.002 |
| review_scores_rating | 0.003 | 0.006 |
| room_typeHotel room | 0.063 | 0.206 |
| room_typePrivate room | -0.300 | -0.237 |
| room_typeShared room | -0.728 | -0.349 |
| bathrooms | 0.158 | 0.192 |
| beds | -0.048 | -0.028 |
| accommodates | 0.108 | 0.127 |
| host_total_listings_count | 0.001 | 0.001 |
| host_identity_verifiedTRUE | -0.023 | 0.028 |
| neighbourhood_simplifiedCape Flats | -0.262 | -0.138 |
| neighbourhood_simplifiedCBD | -0.290 | -0.209 |
| neighbourhood_simplifiedOuter Suburbs | -0.278 | -0.197 |
| neighbourhood_simplifiedSommerset | -0.269 | -0.141 |
| neighbourhood_simplifiedSouthern Suburbs | -0.706 | -0.514 |
| security_deposit | 0.000 | 0.000 |
| cancellation_policyluxury_moderate | 1.395 | 1.757 |
| cancellation_policyluxury_super_strict_95 | 1.594 | 2.486 |
| cancellation_policymoderate | -0.063 | 0.000 |
| cancellation_policystrict_14_with_grace_period | 0.108 | 0.165 |
| cancellation_policysuper_strict_30 | 0.528 | 0.664 |
| cancellation_policysuper_strict_60 | -0.427 | -0.014 |
Running the model on the test data
We attempted to run the test data against the predicted values and analyze the Root Mean Squared Error, however this proved problematic as large portions of the test data contain N/A variables which are obscuring the prediction. We would thus need to clean and rename N/A values across all possible variables.This proves difficult as we can not impute missing values with the mean or median as we risk creating unnecessary bias in our model.
rmse_model <- listings_model%>%
mutate(predictions=predict(model6_log_policy,.))%>%
summarize(sqrt(sum(exp(predictions)-price_4_nights)**2/n()))%>%
pull()
rmse_model
## [1] NA
#rmse_test <- listing_test%>%
#mutate(predictions=predict(model6_log_policy,.))%>%
#summarize(sqrt(sum(exp(predictions)-price_4_nights)**2/n()))%>%
#pull()
#rmse_test
Prediction of Price
After a very challenging Applied Statistics with R module, two of us are planning to visit Cape Town for 4 nights over reading week, and we want to stay in an Airbnb. Find Airbnb’s that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights. We have also elected for a rental that has 2 beds, 1 bathroom and accommodates 2.We are not interested in the number of host listing counts and would like to not pay a security deposit or cancellation fee as we are students. After consulting our Cape Town expert (Jason) we would like to stay in the Atlantic Seaboard neighborhood as he advises it has the best views and shops.
predict = 7.483 -0.001934*(10) + 0.004396*(90) -0.2683*(1)+ 0.1748*(1) -0.03765*(2) + 0.1179*(2) # substituted criteria into equation deleted all values that did not apply or will null value like zero security deposit
Price= exp(predict)
Price
## [1] 2769
exp(0.5514)
## [1] 1.74
# SE for confidence intervals is 0.5514
upper= exp(predict+0.5514*1.96)
lower=exp(predict-0.5514*1.96)
ci= merge(upper,lower)#sorry Kostis for the merge ;)
names(ci)[1] = "97.5%"
names(ci)[2] = "2.5%"
kbl(ci)
| 97.5% | 2.5% |
|---|---|
| 8160 | 940 |
The model estimates that it will cost $2769.17(Although we believe currencies are actually reported in ZAR but recorded in dollars) with a 95% confidence interval [939.69; 8160.36]. This is a large interval considering the sample size however it is not that unexpected as the inverse log value for standard error is
Suggestion for further work
There are some methods that can be implemented to improve the quality of this model. Firstly, the method of obtaining data. There are multitudes of missing values across the entire data set as well as many incorrect observation in variables such as a property that was listed in the USA but appeared on this data set.This brings suspicion on the quality and reliability of data supplied by AirBnB. One significant problem we mentioned was heteroscedasticity, which we didn’t manage to remove from our final model. This could have been solved by removing outliers and influential points identified for the model, or examining a different transformation than the ones we did.
Another issue comes from the fact that the actual property prices are decided by the hosts themselves, not based on any predefined criteria i.e hosts just set the price they want. This can have a large impact on the outcome of this analysis, as things like pre-definition can influence the overall outcome.
Further analysis includes obtaining information on possible influential factors that are not included in this dataset. Such as whether the property is new or old, whether there is an elevator/lift and, as crime is a major issue in South Africa, whether some sort of security is included in a particular listing.